专利摘要:
A method of operating a system comprising multiple processor blocks divided into a plurality of domains, wherein in each domain the tiles are connected to each other via a respective instance of a deterministic interconnection in time and between the domains the tiles are interconnected via a non-deterministic interconnection over time. The method comprises producing a first calculation stage, then performing a respective internal barrier synchronization in each domain, then performing an internal exchange phase in each domain, and then performing a synchronization. external barrier to synchronize between different domains, then the realization of an external exchange phase between the domains.
公开号:FR3072797A1
申请号:FR1859638
申请日:2018-10-18
公开日:2019-04-26
发明作者:Daniel John Pelham WILKINSON;Stephen Felix;Richard Luke Southwell Osborne;Simon Christian Knowles;Alan Graham Alexander;Ian James Quinn
申请人:Graphcore Ltd;
IPC主号:
专利说明:

DESCRIPTION
TITLE: SYNCHRONIZATION IN A MULTIPLE PAVER AND MULTIPLE CHIP PROCESSING ARRANGEMENT
Technical Field This description relates to the synchronization of the workloads of multiple different blocks in a processor comprising a multi-block processing arrangement, each block comprising its own processing unit and its own memory. In particular, the description relates to massive synchronous parallel communications (BSP) schemes in which each block of a group of blocks must complete a calculation phase before any of the blocks of the group can proceed to an exchange phase. .
BACKGROUND ART A multiple thread processor is a processor capable of executing multiple program threads side by side. The processor may include hardware that is common to the multiple different threads (eg, an instruction memory, a data memory, and / or a common thread); but to support multi-wire operation, the processor also includes dedicated hardware specific to each wire.
The dedicated hardware comprises at least one bank of respective context registers for each of the numerous execution threads which can be executed at the same time. A context, when talking about multi-thread processors, refers to the program status of a respective one of the threads running side by side (for example program counter value, state and current values d operands). The context register bank designates the respective set of registers intended to represent this program state of the respective thread. The registers of a register bank are distinct
B17782 FR-409133FR of general purpose memory in that the addresses of the registers are fixed in the form of bits in instruction words, while the memory addresses can be calculated by executing instructions. The registers of a given context typically include a respective program counter for the respective thread of execution, and a respective set of registers of operands to temporarily maintain the data on which one acts and which are supplied by the respective thread during the calculations performed by this thread. Each context can also have a respective state register to store a state of the respective thread (for example if it is paused or running). Thus each of the threads in progress has its own separate program counter, and optionally operand registers and one or more status registers.
One possible form of multi-wire operation is parallelism. That is, as well as multiple contexts, multiple execution pipelines are provided: that is, there is a separate execution pipeline for each instruction flow to run in parallel. However, this requires a large amount of duplication when it comes to hardware.
Therefore, instead, another form of processor with multiple execution threads uses simultaneity rather than parallelism, from which it follows that the threads share a common execution pipeline (or at least a common part of a pipeline) and different threads are interleaved in this same shared execution pipeline. The performance of a multi-thread processor can be further improved compared to non-simultaneous or parallel operation, thanks to improved opportunities to hide pipeline latency. Also, this approach does not require as much additional hardware
B17782 FR-409133FR dedicated to each wire as in a completely parallel processor with multiple execution pipelines, and thus does not require as much additional silicon.
A form of parallelism can be obtained by means of a processor comprising an arrangement of multiple blocks on the same chip (that is to say the same elementary chip), each block respectively comprising separately its own processing unit and its own memory (including program memory and data memory). Thus separate portions of program code can be executed in parallel on different blocks. The blocks are connected to each other via an interconnection on the chip which allows the code executed on different blocks to communicate between the blocks. In some cases, the processing unit of each block can itself execute multiple simultaneous threads on the block, each block having its own set of respective contexts and its corresponding pipeline as described above in order to support the intertwining of multiple wires on the same pad across the same pipeline.
In general, there may be dependencies between the portions of a program executing on different blocks. A technique is therefore necessary to prevent a piece of code on one pad from running ahead of data on which it depends which is made available by another piece of code on another pad. There are a number of possible schemes for achieving this, but the scheme we are concerned with here is known as a massive synchronous parallel scheme (BSP). According to the BSP scheme, each block performs a calculation phase and an exchange phase in an alternative cycle. During the calculation phase, each block performs one or more calculation tasks locally on the block, but does not communicate any of the results of its calculations.
B17782 FR-409133FR to any of the other paving stones. In the exchange phase, each block is authorized to exchange one or more calculation results from the previous calculation phase with one or more of the others in the group, but does not yet proceed to the next calculation phase. In addition, according to the BSP principle, a barrier synchronization is placed at the join making the transition between the calculation phase and the exchange phase, or at the transition between the exchange phase and the calculation phase, or of them. In other words: either (a) all the blocks must complete their respective calculation phases before any one of the group is authorized to proceed to the next exchange phase, or (b) all the blocks of the group must complete their respective exchange phases before any of the group's blocks are authorized to proceed to the next calculation phase, or (c) both. In certain scenarios, a block performing a calculation may be authorized to communicate with other resources of the system such as a network card or a storage disk, as long as no communication with other blocks in the group is involved.
In an interconnected system of transmitters and receivers which may also have calculation tasks to be carried out between communications between them, there are mainly three ways for transmitters and receivers to implement this. The first way is the appointment approach. In this approach, the transmitter sends a signal to the receiver when it is ready to send data and the receiver sends a signal to the transmitter when it is ready to receive data. If the transmitter has data ready to send but the receiver is carrying out another calculation task, then either the transmitter must wait until the receiver finishes its calculation task and signals that it is ready to receive
B17782 FR-409133FR transmitter data, or the transmitter must interrupt the receiver. Similarly if the receiver requests data from the transmitter while the transmitter is still performing another calculation task, then either the receiver must wait for the transmitter to finish its calculation task and signal that it is ready to send data to the receiver, or the receiver must interrupt the transmitter. The appointment approach has the advantage that it does not require queues to form the queue of transmitted data, since the communication of the data takes place only once the transmitter and the receiver are both agree that they are ready to communicate. However, the downside is latency: transmitters and receivers will spend a lot of time waiting, or else they will have to make a lot of interruptions, which results in a latency penalty. Latency ultimately manifests as reduced overall throughput.
The second possibility is the mailbox approach. According to this approach, the transmitter does not wait for the receiver before sending its data. Instead, the transmitted data is buffered in a queue from which the receiver reads the data when it is ready. As long as the queues are long enough this solves the latency problem of the approach to appointment. However, if the queues are filled, the process hangs and indeed communications fall back when approaching an appointment. To reduce the chances of this, the queues should be designed long enough for the amount of data expected to be transmitted. But the queues result in a significant silicon area, especially in a matrix of many potential combinations of transmitters and receivers. Also,
B17782 FR-409133FR in practice the queues cannot be chosen infinitely long.
Solid synchronous parallelism (BSP) provides a third way of doing this: each block performs a certain defined amount of calculation processing in a calculation phase, then all the blocks synchronize with one another (by barrier synchronization) before to move towards an exchange phase. This does not cause as much latency as the appointment approach, and does not cause as much queuing as the mailbox approach.
An example of the use of multi-wire and / or multi-block processing is found in artificial intelligence. As is known to those skilled in the art in the field of artificial intelligence, an artificial intelligence algorithm is based on performing iterative updates of a knowledge model, which can be represented by a graph of multiple interconnected nodes. Each node represents a function of its inputs. Some nodes receive inputs from the graph and some nodes receive inputs from one or more other nodes, while the output of certain nodes forms the inputs of other nodes, and the output of certain nodes provides the output of the graph (and in some cases a given node can even have all this: inputs from the graph, outputs from the graph and connections to other nodes). In addition, the function at each node is parameterized by one or more respective parameters, that is to say weights. During a learning step the goal is, on the basis of a set of experimental input data, to find values for the various parameters so that the graph as a whole generates a desired output for a range of possible entries. Various algorithms for achieving this are known in the art, such as a
B17782 FR-409133FR back propagation algorithm based on a stochastic gradient descent. On multiple iterations based on the input data, the parameters are gradually adjusted to reduce their errors, and thus the graph converges towards a solution. In a subsequent stage, the learned model can then be used to make predictions of outputs given a specified set of inputs or to make an inference regarding inputs (causes) given a specified set of outputs.
The implementation of each node will involve data processing, and the interconnections of the graph correspond to data to be exchanged between the nodes. Typically, at least part of the processing of each node can be performed independently of some or all of the other nodes of the graph, and therefore large graphs present great opportunities for simultaneity and / or parallelism.
Summary of the Invention The present invention starts from a position of choosing a massive synchronous parallel approach (BSP) as the basis for communications between blocks.
According to the present invention, it is desired to implement a BSP process in the whole of a system comprising multiple processing blocks arranged in different deterministic domains in time, in which the communications between blocks in the same domain are deterministic over time, but communications between blocks in different domains are not deterministic over time. In such scenarios, the inventors have identified that it would be desirable to separate the BSP process into deterministic over time and non-deterministic over time stages, so as to prevent determinism
B17782 FR-409133FR temporal at least some of the deterministic exchanges over time in deterministic domains over time is contaminated by non-deterministic exchanges over time between such domains.
According to one aspect described here, there is provided a method of actuating a system comprising multiple processor blocks divided into a plurality of domains, in which in each domain the blocks are connected to each other by means of a respective instance of a deterministic interconnection over time, and between the fields the blocks are connected to each other via a non-deterministic interconnection over time; the process comprising:
on each respective block of a group participating in some or all of the blocks in all of the fields, perform a computation stage in which the respective block performs one or more respective calculations on the block, but does not communicate the results of calculation neither to nor from any of the other cobblestones in the group;
in each respective area of said one or more areas, perform a synchronization with respective internal barrier to impose that all the participating tiles in the respective area have completed the calculation phase before any of the participating tiles in the respective area is authorized to carry out an internal exchange phase, thus establishing a common time reference between all the participating blocks internally in each individual domain of said one or more domains;
following the respective internal barrier synchronization, carry out the internal exchange phase in each of said one or more fields, in which each block participating in the respective field communicates one or more results of its respective calculations to and / or from of one or
B17782 FR-409133FR several other blocks among the participating blocks in the same domain via the deterministic interconnection over time, but does not communicate calculation results either to or from any of the other domains; perform an external barrier synchronization to impose that all the participating blocks of said domains have completed their internal exchange phase before any of the participating blocks is authorized to carry out an external exchange phase, thereby establishing a reference for common time between all participating pavers in all areas; and following the external barrier synchronization, carry out the external exchange phase in which one or more of the participating blocks communicate one or more of the calculation results with another of the domains via the non-deterministic interconnection in time.
That is to say that firstly one or more internal BSP stages are executed from which it results that blocks located in the same deterministic domain in time synchronize and exchange data with each other, but are not not required to synchronize with entities in different deterministic domains, and do not exchange data between these domains. Then, a separate external BSP stage is produced in which all the blocks located in the whole of the non-deterministic sphere in the larger time synchronize in a global synchronization then exchange data between the domains.
One reason why temporal determinism is desirable is that it allows communications between blocks in the same area without being subject to the hold on the queues of silicon in the respective interconnection.
B17782 FR-409133FR [0018] This is why, in embodiments, the communications via the non-deterministic interconnection over time are queued, but the communications between blocks via the deterministic interconnection over time are not queued.
In embodiments, on the deterministic interconnection over time, the communication between each pair of transmitter and receiver blocks is carried out by: sending a message from the receiver block, and controlling the receiver block to listen an address of the transmitter block at a predetermined time interval after transmission by the transmitter block, in which the predetermined time interval is equal to a total predetermined time between the transmitter block and the receiver block, the time interval being defined by a compiler with predetermined information about the delay.
Another reason for dividing the deterministic spheres over time and non-deterministic spheres over time is that temporal determinism typically involves lossless support at the physical level, but on the other hand it may not be practical to extend this over an undefined range of cobblestones. Consequently here again it would be desirable to allow a temporal determinism in certain fields while avoiding that the broad non-deterministic communications in time contaminate a phase of deterministic exchange in time.
This is why in embodiments, the deterministic interconnection over time is lossless, while the non-deterministic interconnection over time is lossy at the level of a physical layer, of a transport layer. or a network layer.
B17782 FR-409133EN Another reason to make a division between the deterministic sphere in time and the non-deterministic sphere in time and that, in embodiments, a deterministic interconnection in time is provided for an exchange of internal data on a chip but it is less practical to make communications between chips which are deterministic over time.
This is why in embodiments, each of the domains can be a different respective chip, the deterministic interconnection over time being an internal interconnection on the chip and the non-deterministic interconnection over time being an external interconnection between the chips.
A larger penalty will be suffered in data exchanges between chips compared to internal communications between blocks on the same chip. External communication experiences longer latency and greater uncertainty compared to internal communication, since it is less local. Connections between chips tend to have a lower wiring density due to limitations imposed by the case, and therefore a lower available data bandwidth. Also, wires reach greater distances and are therefore more capacitive, and are more vulnerable to noise (which can cause losses and therefore the need for retransmissions at the physical layer). In addition, in addition to greater physical distance, data transfers between chips typically pass through a larger amount of logic such as SerDes (serializers-deserializers) logic and flow control mechanisms, all adding delays additional to internal communications.
B17782 FR-409133FR By separating internal BSP stages, on a chip, and external, outside the chips, it is prevented that some of the exchanges between blocks on the same chip are contaminated by the latency of a global exchange, which is a more costly operation in terms of latency. If each BSP stage instead involved global synchronization and exchange this could lead to a significantly slower program.
The different chips can be different elementary chips in the same integrated circuit package (IC), or different elementary chips in different IC packages, or a mixture of this.
It will also be noted that more generally, it is not excluded that the division between deterministic sphere in time and non-deterministic sphere in time is made elsewhere, other than at the border on the chip / out of the chip. For example, a deterministic time interconnection could be provided for data exchanges between subgroups of multiple chips, or alternatively different time deterministic domains which are asynchronous with each other could be formed on the same chip.
This is why, in embodiments, each of the fields can comprise multiple chips, the deterministic interconnection over time being an external inter-chip interconnection without losses and the non-deterministic interconnection over time being a external interconnection with losses.
[0029]
In embodiments, the method may include performing a series of repetitive iterations, each comprising a respective instance of the compute stage, followed by a respective instance of internal barrier synchronization, followed by a instance
B17782 FR-409133FR respective to the internal exchange phase, followed by a respective instance of external barrier synchronization, followed by a respective instance of the external exchange phase; and each successive iteration is not allowed to proceed before the external barrier synchronization of the immediately preceding iteration has been carried out.
In embodiments, the method can comprise the production of a sequence of instances of the calculation phase, each being followed by a corresponding instance of the internal exchange phase and then by a corresponding instance of internal barrier synchronization, and external barrier synchronization can follow the last calculation phase in said sequence.
In embodiments, each of said one or more iterations may comprise a respective sequence of multiple instances of the calculation phase, each being followed by a corresponding instance of the internal exchange phase and then by a corresponding instance internal barrier synchronization, and the respective external barrier synchronization can follow the last instance of the calculation phase in the respective sequence.
In embodiments, each of the internal and external barrier synchronizations can be carried out by executing a synchronization instruction comprising an operation code and an operand, the operand specifying a mode of the synchronization instruction as being either internal either external, and the operation code, when executed, causing the hardware logic found in the deterministic interconnection in time to coordinate the operation of the synchronization with internal barrier when the operand specifies the internal mode, and bringing the hardware logic found in the interconnect not
B17782 FR-409133FR deterministic in time to coordinate the operation of external barrier synchronization when the operand specifies the external mode.
In embodiments, the method may include selecting one of a plurality of predefined areas as participating tiles, each area comprising a different set or subset of the multiple areas.
In embodiments, the zones are hierarchical, with at least two lower level zones which are nested in at least one upper level zone [0035] In embodiments, the operand of the instruction of synchronization can specify which of a plurality of different possible variants of the external mode of external barrier synchronization applies, each corresponding to a different zone in said zones.
In embodiments, the variants of the external mode specify at least to which hierarchical zone level the synchronization with external barrier applies.
In embodiments, external synchronization and exchange can include:
first carry out an external synchronization of a first level then a forced exchange in a first zone of a lower level of the hierarchical zones; and following synchronization and exchange of the first level, performing synchronization and external exchange of a second level in a second zone of a higher level of said zones.
In embodiments, one of the hierarchical zones can be made up of all the blocks
B17782 FR-409133FR finding on the same integrated circuit box, IC, but not beyond; and / or one of the hierarchical zones can be made up of all the tiles located in the same map, but not beyond; and / or one of the hierarchical zones can be made up of all the blocks located in the same frame, but not beyond.
In embodiments, the method may include the execution of a forbearance instruction on one or some of the blocks, the operation code of the forbearance instruction bringing the block or blocks on which it is executed to be excluded from the group.
In embodiments, in the external exchange phase, one or more of the participating blocks can also communicate one or more of the calculation results to a host processor via the external interconnection, the host processor being implemented. on a separate host processor chip.
In embodiments, in the calculation phase, some or all of the participating blocks can execute a batch of working threads in an interlaced manner, and internal barrier synchronization can impose that all the working threads in all the lots are out.
In embodiments, the method can include the use of the system to perform an artificial intelligence algorithm in which each node in a graph has one or more respective input vertices and one or more vertices of respective outputs, the input vertices of at least some of the nodes being the output vertices of at least some others of the nodes, each node comprising a respective function connecting its output vertices to its input vertices, each respective function being set by a
B17782 FR-409133FR or more than one respective parameter, and each of the respective parameters having an associated error, so that the graph converges to a solution when the errors in some or all of the parameters are reduced; in which each of the blocks can be used to model one or more respective nodes of the nodes of the graph.
In embodiments, the chips are AI accelerator chips assisting the host processor.
In alternative aspects of this description, the primary division between the different levels of BSP could be made between the spheres on the chip and outside the chip, rather than necessarily dividing the BSP process according to deterministic spheres in time and not deterministic in time. It is not excluded that internal and external communications can both be made deterministic in time, or that neither is, or that the division between deterministic spheres in time is not particularly drawn according to whether blocks are on the same chip or on different chips. In such cases, the separation between internal BSP, on the chip, and external, outside the chip, will still be advantageous with regard to the latency described above.
Therefore, according to another aspect described here, there is provided a method of actuating a system comprising multiple processor chips connected together via an external interconnection, at least one of the chips comprising a matrix of processor blocks connected to each other by an internal interconnection; the method comprising: on each respective block of a group participating in some or all of the blocks extending over one or more
B17782 FR-409133FR of the chips, perform a computation stage in which the respective block performs one or more respective calculations on the block, but does not communicate calculation results either to or from any of the other blocks in the group;
on each respective chip of said one or more chips, carry out a synchronization with respective internal barrier to impose that all the participating tiles on the respective chip have completed the calculation phase before any of the participating tiles on the respective chip is authorized to carry out an internal exchange phase;
following the respective internal barrier synchronization, carry out the internal exchange phase on each of said one or more chips, in which each block participating on the respective chip communicates one or more results of its respective calculations to and / or from one or more other blocks from among the participating blocks on the same chip via the internal interconnection, but does not communicate calculation results to or from any of the other chips;
perform an external barrier synchronization to impose that all the participating blocks on said one or more chips have completed their internal exchange phase before any of the group's blocks is authorized to carry out an external exchange phase; and following the synchronization with an external barrier, carry out an external exchange phase inside a set of participating chips, in which one or more of the blocks of the group communicate one or more of the calculation results with another participating chips via the external interconnection.
According to another aspect described here, there is provided a computer program product incorporated on a medium readable by a computer and comprising code arranged from
B17782 FR-409133FR such that, when executed on paving stones, it performs operations according to any of the methods described here.
According to another aspect described here, there is provided a system comprising multiple processor blocks divided into plurality of domains, in which in each domain the blocks are connected together via a respective instance of a deterministic interconnection over time and between the fields the cobblestones are connected to each other via a non-deterministic interconnection over time; the system being programmed to carry out the following operations: on each respective block of a group participating in some or all of the blocks in all of the fields, carry out a computation stage in which the respective block performs one or more respective calculations on the pad, but does not communicate calculation results to or from any of the other pads in the group;
in each respective area of said one or more areas, perform a synchronization with respective internal barrier to impose that all the participating tiles in the respective area have completed the calculation phase before any of the participating tiles in the respective area is authorized to carry out an internal exchange phase, thus establishing a common time reference between all the participating blocks internally in each individual domain of said one or more domains;
following the respective internal barrier synchronization, carry out the internal exchange phase in each of said one or more fields, in which each block participating in the respective field communicates one or more results of its respective calculations to and / or from one or more other paving stones among the participating paving stones in the same domain via the interconnection
B17782 FR-409133FR deterministic over time, but does not communicate calculation results to or from any of the other domains; perform an external barrier synchronization to impose that all the participating blocks of said domains have completed their internal exchange phase before any of the participating blocks is authorized to carry out an external exchange phase, thereby establishing a reference for common time between all participating pavers in all areas; and following the external barrier synchronization, carry out the external exchange phase in which one or more of the participating blocks communicate one or more of the calculation results with another of the domains via the non-deterministic interconnection in time.
BRIEF DESCRIPTION OF THE DRAWINGS To facilitate understanding of the present description and to show how embodiments can be implemented, reference will be made, by way of example, to the accompanying drawings in which:
[Fig. 1] Figure 1 is a block diagram of a multi-wire processing unit;
[Fig. 2] Figure 2 is a block diagram of a plurality of child contexts;
[Fig. 3] FIG. 3 illustrates a diagram of interlaced execution time slots;
[Fig. 4] Figure 4 illustrates a supervisor wire and a plurality of working wires;
[Fig. 5] Figure 5 is a logic block diagram for aggregating output states of multiple wires;
[Fig. 6] FIG. 6 schematically illustrates the synchronization between working wires on the same block;
B17782 FR-409133FR [0055] [Fig. 7] Figure 7 is a block diagram of a processor chip comprising multiple blocks;
[Fig. 8] Figure 8 is a schematic illustration of a massive synchronous parallel computing (BSP) model;
[Fig. 9] Figure 9 is another schematic illustration of a BSP model;
[Fig. 10] Figure 10 is a schematic illustration of BSP between multi-wire processing units;
[Fig. 11] Figure 11 is a block diagram of an interconnection system;
[Fig. 12] Figure 12 is a schematic illustration of a system of multiple interconnected processor chips;
[Fig. 13] Figure 13 is a schematic illustration of a multilevel BSP scheme;
[Fig. 14] Figure 14 is another schematic illustration of a system of multiple processor chips;
[Fig. 15] Figure 15 is a schematic illustration of a graph used in an artificial intelligence algorithm;
[Fig. 16] FIG. 16 schematically illustrates an arrangement for exchanging data between blocks;
[Fig. 17] FIG. 17 illustrates a temporal diagram of exchanges;
[Fig. 18] Figure 18 illustrates an example of wiring for synchronization between chips; and [Fig. 19] Figure 19 schematically illustrates an external flow control mechanism for exchanges between chips.
B17782 FR-409133FR
Detailed description of embodiments We will describe in the following components of a processor having an architecture which has been developed to respond to problems appearing in the calculations involved in applications of artificial intelligence. The processor described here can be used as a work accelerator; that is, it receives a workload from an application running on a host computer, the workload generally having the form of very large data sets to be processed (such as large sets experimental data used by an artificial intelligence algorithm to learn a knowledge model, or the data from which to make a prediction or an inference using a previously learned knowledge model). One goal of the architecture presented here is to process these very large data sets very efficiently. Processor architecture was developed to deal with workloads involved in artificial intelligence. However, it will be clear that the architecture described can also be suitable for other workloads sharing similar characteristics.
FIG. 1 illustrates an example of processor module 4 according to embodiments of the present description. For example, the processor module 4 can be a block of a matrix of similar processor blocks on the same chip, or can be implemented in the form of a stand-alone processor on its own chip. The processor module 4 comprises a multi-wire processing unit 10 in the form of a barrel processing unit, and a local memory 11 (that is to say on the same block in the case of a multi-matrix -paved, or the same chip in the case of a single processor chip). A barrel processing unit is a type of
B17782 FR-409133FR multi-thread processing in which the pipeline execution time is divided into a repetitive sequence of interleaved time slots, each of which can be owned by a given thread. This will be described in more detail in a moment. The memory 11 includes an instruction memory 12 and a data memory 22 (which can be implemented in various different addressable memory modules or in different regions on the same addressable memory module). The instruction memory 12 stores machine code to be executed by the processing unit 10, while the data memory 22 stores both data on which the executed code will operate and output data produced by the executed code (for example a result of such operations).
The memory 12 stores various different threads of a program, each thread comprising a respective sequence of instructions for performing a certain task or certain tasks. Note that an instruction as designated here designates a machine code instruction, that is to say an instance of one of the fundamental instructions of the processor instruction set, consisting of a single operation code and of zero or more operands.
The program described here comprises a plurality of working son, and a supervisor subroutine which can be arranged in the form of one or more supervisor son. This will be described in more detail in a moment. In embodiments, each of some or all of the working threads has the shape of a respective codelet. A codelet is a special type of thread, sometimes also called an atomic thread. It has all the input information it needs for its execution from the start of the thread (from the moment of launch), i.e. it takes no input from any other part of the program or
B17782 FR-409133FR in memory after being launched. In addition, no other part of the program will use outputs (results) of the wire until it is finished (it ends). Unless he encounters an error, he is guaranteed to finish. It will be noted that certain literatures also define a codelet as being stateless, that is to say that if it is executed twice it will not be able to inherit any information coming from its first execution, but this additional definition does not is not adopted here. It will also be noted that not all of the working threads need to be (atomic) codelets, and in some embodiments some or all of the working threads may instead be able to communicate with each other.
In the processing unit 10, multiple different threads among the threads coming from the instruction memory 12 can be interleaved in a single execution pipeline 13 (although typically only a subset of all the threads stored in the instruction memory can be interleaved at any point in the global program). The multi-thread processing unit 10 comprises: a plurality of banks of registers 26, each arranged to represent the state (context) of a respective different thread among the threads to be executed simultaneously; a shared execution pipeline 13 which is common to the threads executed simultaneously; and a scheduler 24 for scheduling the simultaneous threads for their execution in the shared pipeline in an interlaced manner, preferably in turn. The processing unit 10 is connected to a shared instruction memory 12 common to the plurality of wires, and to a shared data memory 22 which is still common to the plurality of wires.
The execution pipeline 13 comprises an extraction stage 14, a decoding stage 16, and a stage
B17782 FR-409133EN of execution 18 comprising an execution unit on which can perform arithmetic and logical operations, loading and storage operations, and others than defined by the architecture of the instruction set. Each of the context register banks includes a respective set of registers for representing the program state of a respective thread.
[0074]
An example of the registers which each constitute includes a respective command or
28, comprising at least one program counter (PC) for the respective thread (to keep track of the instruction address at which the thread is in progress embodiments also a set of one or more status registers (SR) recording a current state of the respective thread (as if it is in progress since it has encountered context registers 26 also includes a respective set of operand registers (OP) 32, for operates or resulting from defined operations by the operation codes of the instructions of the respective threads when they notice that each of the banks of context registers can optionally include one or more other types of respective registers (not
It will also be noted that although the term bank of registers is sometimes used to designate a group of registers in a common address space, this need not necessarily be the case in the present description each of the material contexts 26 (each of the sets of
B17782 FR-409133EN registers 26 representing each context) can more generally comprise one or more banks of registers of the kind.
As will be described in more detail below, the arrangement described comprises a bank of work context registers CXO ... CX (Ml) for each of the M wires which can be executed simultaneously (M = 3 in the example illustrated, but this is not limiting), and a bank of additional supervisor context registers CXS. The banks of work context registers are reserved for memorizing the contexts of work threads, and the bank of registers of supervisor context is reserved for memorizing the context of a supervisor wire. It will be noted that in embodiments the supervisor context is special, and that it comprises a different number of registers compared to that of the working threads. Each of the working contexts preferably includes the same number of status registers and operand registers as the others. In embodiments, the supervisor context may include fewer operand registers than each of the work threads. Examples of operand registers that the work context may include and that the supervisor does not include: floating point registers, accumulator registers, and / or dedicated weighting registers (to contain neural network weights ). In embodiments, the supervisor can also include a different number of status registers. Furthermore, in embodiments, the architecture of the instruction set of the processor module 4 can be arranged so that the working threads and the supervisor thread (s) execute different types of instructions but also share certain types of instructions. 'instructions.
B17782 FR-409133FR The extraction stage 14 is connected so as to extract from the instruction memory 12 instructions to be executed, under the control of the scheduler 24. The scheduler 24 is arranged to control the extraction stage 14 for extracting an instruction from each thread of a set of threads executing simultaneously in turn in a repetitive sequence of time slots, thus dividing the resources of the pipeline 13 into a plurality of time slots interleaved in time, as we will describe in more detail in a moment. For example, the scheduling scheme could be a turn or a weighted turn. Another term for a processor operating in this way is a barrel execution thread processor.
In some embodiments, the scheduler 24 may have access to one of the status registers SR of each thread indicating whether the thread is paused, so that the scheduler 24 actually controls the extraction stage 14 for extracting the instructions from only the wires which are currently active. In embodiments, preferably each time slot (and the corresponding context register bank) is always owned by one thread or another, that is to say that each slot is always occupied by a certain thread, and each slot is always included in the sequence of the scheduler 24; although it may happen that the thread occupying a given slot can be paused at this instant, in which case when the sequence comes to this slot, the instruction extraction for the respective thread is skipped. As a variant, it is not excluded, for example, that in less preferred variant embodiments, certain slots may be temporarily vacant and excluded from the scheduled sequence. When referring to the number of time slots that the thread is capable of
B17782 FR-409133FR to interleave, or similar, this denotes the maximum number of slots that the thread is capable of executing simultaneously, i.e. the number of simultaneous slots that the hardware of the execution unit supports.
The extraction stage 14 has access to the program counter (PC) of each of the contexts. For each respective thread, the extraction stage 14 extracts the next instruction from this thread from the next address in the program memory 12 as indicated by the program counter. The program counter increments with each execution cycle unless it is bypassed by a branch instruction. The extraction stage 14 then passes the extracted instruction to the decoding stage 16 so that it is decoded, and the decoding stage 16 then passes an indication of the decoded instruction to the execution unit 18 accompanied by the decoded addresses of all the operand registers 32 specified in the instruction, so that the instruction is executed. The execution unit 18 has access to the operand registers 32 and to the control registers 28, which it can use in the execution of the instruction on the basis of the addresses of decoded registers, as in the case of an arithmetic instruction (for example by adding, multiplying, subtracting or dividing the values in two operand registers and providing the result to another operand register of the respective wire). Or if the instruction defines a memory access (loading or storage), the loading / storage logic of the execution unit 18 loads a value from the data memory into an operand register of the respective thread, or stores a value from an operand register of the respective wire in the data memory 22 in accordance with the instruction. Or if the instruction defines a connection or a change of state, the execution unit changes the value in the PC program counter
B17782 FR-409133FR or one of the SR status registers accordingly. It will be noted that while an instruction of a thread is executed by the execution unit 18, an instruction originating from the thread located in the next time slot in the interlaced sequence may be being decoded by the stage of decoding 16; and / or while an instruction is decoded by the decoding stage 16, the instruction originating from the wire being in the next time slot after this one may be being extracted by the extraction stage 14 (although in general the scope of the description is not limited to an instruction by time slot, for example in scenario variants a batch of two or more instructions could be issued by a given thread by time slot). Interlacing thus advantageously masks the latency in the pipeline 13, in accordance with known techniques for processing barrel wires.
An example of the interleaving diagram implemented by the scheduler 24 is illustrated in FIG. 3. Here the simultaneous wires are interleaved according to a rotation diagram by which, in each revolution of the diagram, the revolution is divided in a sequence of time slots S0, SI, S2. „, each to execute a respective thread. Typically, each slot has a length of one processor cycle and the different slots have equal sizes although this is not necessary in all possible embodiments, for example a weighted turn diagram is also possible in which some threads get more cycles than others on each run. In general, the execution of barrel wires can use either an equal turn diagram or a weighted turn diagram, in the latter case the weighting can be fixed or adaptive.
Whatever the sequence for each turn of execution, this pattern is repeated, each turn
B17782 FR-409133FR comprising a respective instance of each of the time slots. It should therefore be noted that a time slot as designated here designates the place allocated repeatedly in the sequence, not a particular instance of the slot in a given repetition of the sequence. In other words, the scheduler 24 divides the execution cycles of the pipeline 13 into a plurality of temporally interleaved execution channels (multiplexed by time separation), each comprising a recurrence of a respective time slot in a sequence repetitive time slots. In the illustrated embodiment, there are four time slots, but this is only for purposes of illustration and other numbers are possible. For example, in a preferred embodiment there are actually six time slots.
Whatever the number of time slots into which the turn-based scheme is divided, then according to the present description, the processing unit 10 comprises a bank of context registers 26 more than the number of time slots , that is, it supports one more context than the number of interlaced time slots it is capable of processing in barrels.
This is illustrated by the example of Figure 2: if there are four time slots S0 ... S3 as shown in Figure 3, then there are five banks of context registers, referenced here CX0, CX1, CX2, CX3 and CXS. That is to say that even if there are only four execution time slots S0 ... S3 in the barrel wire diagram and thus only four wires can be executed simultaneously, it is described here to add a fifth bank of CXS context registers, comprising a fifth program counter (PC), a fifth set of operand registers 32, and in embodiments also a fifth set
B17782 FR-409133FR of one or more status registers (SR). Note, however, that as mentioned, in embodiments, the supervisor context may differ from the other CX0 ... 3, and the supervisor thread can support a different set of instructions to activate the execution pipeline 13.
Each of the first four contexts CX0 ... CX3 is used to represent the state of the respective one of a plurality of work threads currently assigned to one of the four execution time slots S0 ... S3, to perform any specific calculation task of an application desired by the programmer (it will also be noted that this may be only the subset of the total number of program work threads as stored in the instruction memory 12) . The fifth context CXS is however reserved for a special function, to represent the state of a supervisor thread (SV) whose role is to coordinate the execution of the work threads, at least in the direction of the assignment of that working threads W which must be executed in such a time slot S0, SI, S2. . . and how well in the overall program. Optionally, the supervisor thread may have other supervisory or coordinating responsibilities. For example, the supervisor wire can be responsible for carrying out barrier synchronizations to ensure a certain order of execution. For example, in the case where one or more second wires depend on data to be supplied by one or more first wires executed on the same processor module 4, the supervisor can perform barrier synchronization to ensure that none of the second wires starts before the first sons are finished. In addition or instead, the supervisor can perform barrier synchronization to ensure that one or more wires on the processor module 4 do not start until some external source of
B17782 FR-409133FR data, such as another pad or processor chip, has completed the processing required to make data available. The supervisor wire can also be used to perform other functionalities associated with the multiple work wires. For example, the supervisor wire can be responsible for communicating external data to the processor module 4 (to receive external data on which it is necessary to act with one or more of the wires, and / or to transmit data supplied by one or more of the wires. of work). In general, the supervisor wire can be used to provide any kind of supervision or coordination function desired by the programmer. For example, in another example, the supervisor can supervise transfers between the local block memory 12 and one or more resources in the larger system (external to the array 6) such as a storage disk or a network card.
It will of course be noted that four slots are only one example, and that in general, in other embodiments there may be other numbers, so that if there is a maximum of M time slots 0 ... Ml per revolution, processor module 4 includes M + l contexts CX ... CX (M-1) &
CXS, i.e. one for each work thread which can be interleaved at any given time and an additional context for the supervisor. For example, in an example implementation there are six time slots and seven contexts.
Referring to Figure 4, the supervisor wire SV does not have its own time slot per se in the diagram of interleaved time slots. The same is true for work leads since the allocation of slots to work leads is defined flexibly. Instead, each time slot has its own dedicated context register bank (CX0 ... CXM-1) to store the work context, which is used by the work thread when the
B17782 FR-409133FR Slot is allocated to the work thread, but not used when the slot is allocated to the supervisor. When a given slot is allocated to the supervisor, instead this slot uses the supervisor's CXS context register bank. Note that the supervisor always has access to his own context and that no work thread is able to occupy the CXS supervisor context register bank.
The supervisor wire SV has the capacity to run in any one and in all the time slots S0 .... S3 (or more generally S0 ... SM-1). The scheduler 24 is arranged for, when the program as a whole starts, to start by allocating to the supervisor wire all of the time slots, that is to say that thus the supervisor SV starts by executing in all the slots S0 ... S3. However, the supervisor thread is provided with a mechanism for, at a certain later point (either immediately or after having performed one or more supervisor tasks), temporarily abandon each of the slots in which it is executed at a respective one of the working wires, for example initially the working wires W0 ... W3 in the example represented in figure
4. This is achieved by the fact that the supervisor thread is executing an abort instruction, called RUN ”by way of example here. In embodiments, this instruction takes two operands: an address of a work thread in the instruction memory 12 and an address of certain data for this work thread in the data memory 22:
RUN task_addr, data_addr The work threads are portions of code which can be executed simultaneously between them, each representing one or more respective calculation tasks to be performed. The data address can specify certain data on which the work thread should act. In
B17782 FR-409133FR variant, the abandonment instruction can take a single operand specifying the address of the work thread, and the address of the data could be included in the code of the work thread; or in another example the single operand could point to a data structure specifying the addresses of the work thread and the data. As mentioned, in embodiments at least some of the working threads can take the form of codelets, i.e., atomic code units executable simultaneously. Alternatively or additionally, some of the working threads need not be codelets and may instead be able to communicate with each other.
The abandonment instruction (RUN) acts on the scheduler 24 so as to abandon the current time slot, in which this instruction is executed itself, at the work wire specified by the operand. Note that it is implicit in the abandonment instruction that it is the time slot in which this instruction is executed that is abandoned (implicit in the context of machine code instructions means that there is no need of an operand to specify this - it is implicitly understood from the operation code itself). So the time slot that is abandoned is the time slot in which the supervisor executes the abandonment instruction. Or put it another way, the supervisor runs in the same space as the one he abandons. The supervisor says to execute this piece of code at this location, then from this point the recurring slot is (temporarily) owned by the concerned work thread.
The supervisor wire SV performs a similar operation in each of one or more of the other time slots, to abandon some or all of its time slots to different respective sons among the sons of
B17782 FR-409133FR work W0 ... W3 (selected from a larger set WCL.wj in instruction memory 12). Once he has done this for the last slot, the supervisor is suspended (he will resume later where he left when one of the slots is returned by a working thread ffl).
The supervisor wire SV is thus capable of allocating different work wires, each carrying out one or more tasks, to different slots among the interleaved execution time slots S0 ... S3. When the supervisor thread determines that it is time to execute a work thread, it uses the RUN instruction to allocate this work thread to the time slot in which the RUN instruction has been executed.
In certain embodiments, the instruction set also includes a variant of the execution instruction, RUNALL (execute all). This instruction is used to launch a set of several work threads together, all of them executing the same code. In embodiments, this launches a working thread in each of the slots of the processing unit S0 ... S3 (or more generally
30 ... S (M-1)).
In addition, in some embodiments, the RUN and / or RUNALL instructions, when executed, also automatically copy a state from one or more of the CXS supervisor state registers (SR) into a or several corresponding status registers of the work thread (s) launched by the RUN or RUNALL instructions. For example, the copied state can include one or more modes, such as a floating point rounding mode (for example rounded to the nearest or rounded to zero) and / or an overflow mode (for example saturates or uses a value representing infinity). The state or mode copied controls
B17782 FR-409133FR then the work wire in question to operate in accordance with the state or mode copied. In embodiments, the work thread can later overwrite this in its own state register (but cannot change the state of the supervisor). In other variations or additional embodiments, the work threads can choose to read a certain state from one or more supervisor state registers (and again can change their own state later). For example, here again this could consist in adopting a mode from the supervisor's status register, such as a floating point mode or a rounding mode. However, in embodiments, the supervisor cannot read any of the CXO ... context registers of the work threads.
Once launched, each of the currently allocated working threads W0 ... W3 proceeds to carrying out one or more calculation tasks defined in the code specified by the respective abandonment instruction. At the end of this, the respective work thread then returns the time slot in which it is running to the supervisor thread. This is achieved by executing an exit instruction (EXIT).
The EXIT instruction takes at least one operand and preferably a single operand, exit_state (for example a binary value), to be used for any purpose desired by the programmer to indicate a state of the respective codelet at its termination (for example to indicate whether a certain condition has been met):
EXIT exit_state The EXIT instruction acts on the scheduler 24 so that the time slot in which it is executed is returned to the supervisor thread. The supervisor wire can then perform one or more of the following supervision tasks (for example
B17782 FR-409133FR barrier synchronization and / or data exchange with external resources like other tiles), and / or continue to execute another abandon instruction to allocate a new work thread (W4, etc. ) at the niche in question. It will again be noted that consequently the total number of execution threads in the instruction memory 12 may be greater than the number of threads that the barrel thread processing unit 10 can interleave at any time. It is the role of the supervisor thread SV to plan which of the working threads W0 ... Wj coming from the instruction memory 12, at what stage in the overall program, must be assigned to such a slot of the interleaved time slots S0. ..SM in the schedule diagram of the scheduler 24.
In addition, the EXIT instruction was given another special function, namely that of bringing the output state specified in the operand of the EXIT instruction to be automatically aggregated (by logic dedicated hardware) with the output states of a plurality of other working threads which are executed in the same pipeline 13 of the same processor module 4 (for example the same block). So an additional implicit facility is included in the instruction to end a thread.
An example of a circuit for achieving this is shown in FIG. 5. In this example, the output states of the individual wires and the aggregate output state each take the form of a single bit, that is to say say 0 or 1. The processor module 4 includes a register 38 for storing the aggregate output state of this processor module 4. This register can here be called the local consensus register $ LC (as opposed to global consensus when the processor module 4 is part of a matrix of similar processor blocks, as will be described in more detail in a moment).
B17782 FR-409133FR
In embodiments, this local consensus register $ LC 38 is one of the supervisor state registers in the supervisor context register bank CXS. The logic for performing the aggregation comprises an AND gate 37 arranged to perform a logical AND of (A) the output state specified in the operand of the instruction EXIT and (B) the current value in the local consensus register ($ LC) 38, and to return the result (Q) in the local consensus register $ LC 38 as the new value of the local aggregate.
At an appropriate synchronization point in the program, the value stored in the local consensus register ($ LC) 38 is initially reset to a value of 1. That is, all the wires outgoing execution after this point will contribute to the locally aggregated output state $ LC until the next reset. The output (Q) of the AND gate 37 is at 1 if the two inputs (A, B) are at 1, but otherwise the output Q goes to 0 if any of the inputs (A, B) is at 0. Each time an EXIT instruction is executed, its exit status is aggregated with those which arrived previously (since the last reset). Thus by means of the arrangement represented in FIG. 5, the logic maintains a current aggregate of the output states of all the work threads which have ended by means of an EXIT instruction since the last time that the local consensus register ( $ LC) 38 has been reset. In this example, the current aggregate indicates if all the threads of execution up to now have gone out to true: any state of exit at 0 coming from any of the working threads will bring the aggregate into register 38 to become locked at 0 until the next reset. In embodiments, the supervisor SV can read the current aggregate at any time by obtaining the current value from the
B17782 FR-409133FR local consensus register ($ LC) 38 (it does not need to wait for synchronization on the keypad to do this).
The reinitialization of the aggregate in the local consensus register ($ LC) 38 can be carried out by the supervisor SV by carrying out a PUT in the address of the local consensus register ($ LC) 38 using one or more general purpose instructions, in this example to put a value of 1 in register 38. As a variant, it is not excluded that the reinitialization may be carried out by an automatic mechanism, for example triggered by executing the SYNC instruction described later here.
The aggregation circuit 37, in this case the AND gate, is implemented in a dedicated hardware circuit in the execution unit of the execution stage 18, using any combination of appropriate electronic components to realize the functionality of a boolean AND. A dedicated circuit or hardware designates circuits having a wired function, unlike being programmed by software using general purpose code. The update of the local output state is triggered by the execution of the special instruction EXIT, this being one of the instructions of fundamental machine code in the instruction set of processor module 4, having the inherent functionality of aggregating output states. Also, the local aggregate is stored in a command register 38, that is to say a dedicated storage element (in certain embodiments, a single storage bit) whose value can be accessed by the code s 'executing in the pipeline, but which cannot be used by the load-storage unit (LSU) to store general purpose data. Instead, the data function held in a command register is fixed, in this case to the locally aggregated output state storage function. Preferably the
B17782 FR-409133FR local consensus register ($ LC) 38 forms one of the command registers on processor module 4 (for example on the keypad), the value of which can be accessed by the supervisor by executing a GET instruction and which can be defined by executing a PUT instruction.
Note that the circuit shown in Figure 5 is only an example. An equivalent circuit would consist of replacing the AND gate 37 with an OR gate and reversing the interpretation of the output states 0 and 1 by software, i.e. 0 -> true, 1 false (with register 38 reset at 0 instead of 1 at each synchronization point). Equivalently, if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state in $ LC will save if one any (rather than all) of the worker states is output with state 1. In other embodiments, the output states need not be single bits. For example, the output state of each individual work thread can be a single bit, but the aggregate output state $ LC can include two bits representing a trinary state: all the work threads are output with state 1 , all the working threads are output with state 0, or the output states of the working threads were mixed. As an example of the logic for implementing this, one of the two bits encoding the trinary value can be a Boolean AND (or an OR) of the individual output states, and the other bit of the trinary value can be a Boolean OR of the individual output states. The third coded case, indicating that the output states of the working wires are mixed, can then be formed by the Exclusive OR of these two bits.
The output states can be used to represent what the programmer wants, but an example
B17782 FR-409133FR particularly contemplated is to use an output state of 1 to indicate that the respective working wire is out with a success or true state, while an output state of 0 indicates that the respective working wire is out with a failed or false state (or vice versa if the aggregation circuit 37 performs an OR instead of an AND and the $ LC register 38 is initially reset to 0). For example, if we consider an application where each work thread performs a calculation with an association condition, as a condition indicating whether the error or errors in said one or more parameters of a respective node in the graph of an artificial intelligence algorithm are at an acceptable level according to a predetermined metric. In this case, an individual output state of a given logic level (for example 1) can be used to indicate that the condition is satisfied (for example that the error or errors in said one or more parameters of the node are at an acceptable level according to a certain metric); whereas an individual output state of the opposite logic level (for example 0) can be used to indicate that the condition was not satisfied (for example the error or errors are not at an acceptable level according to the metric in question). The condition can for example be an error threshold placed on a single parameter or on each parameter, or could be a more complex function of a plurality of parameters associated with the respective calculation performed by the work thread.
In another more complex example, the individual output states of the working wires and the aggregate output state may each comprise two or more bits, which can be used, for example, to represent a degree of confidence in the results of work threads. For example, the output status of each individual work thread
B17782 FR-409133FR can represent a probabilistic measure of confidence in a result of the respective workflow, and the logic of aggregation 37 can be replaced by a more complex circuit to carry out a probabilistic aggregation of individual confidence levels in material form.
Whatever meaning is given by the programmer to the output states, the supervisor wire SV can then obtain the aggregated value from the local consensus register ($ LC) 38 to determine the aggregate output state of all the gui work threads are out since its last reset, for example at the level of the last synchronization point, for example to determine whether or not all the work threads are out in a successful or true state. Based on this aggregated value, the supervisor thread can then make a decision in accordance with the designer's choice of design. The programmer can choose to use the locally aggregated output state as he wishes. For example, the supervisor thread can consult the local aggregate output status to determine if a certain portion of the program consisting of a certain subset of work threads has ended as expected or desired. If it is not the case (for example at least one of the working threads left in a failed or false state), it can report to a host processor, or can carry out another iteration of the part of the program comprising the same working threads; but if it is the case (for example if all the working threads left with a state of success or true) it can instead connect to another part of the program comprising one or more new working threads.
Preferably, the supervisor thread should not access the value found in the local consensus register ($ LC) 38 before all the work threads in question are
B17782 FR-409133FR output, so that the value stored in it represents the correct updated aggregated state of all the desired threads. This wait can be imposed by a barrier synchronization performed by the supervisor wire to wait for all the local work wires running simultaneously (i.e. those located on the same processor module 4, running in the same pipeline 13) came out. That is, the supervisor thread resets the local consensus register ($ LC) 38, launches a plurality of working threads, then launches local barrier synchronization (local to the processing module 4, local to a block ) in order to wait for all pending work threads to exit before the supervisor is authorized to proceed to obtaining the aggregated exit status from the local consensus register ($ LC) 38.
Referring to Figure 6, in embodiments a SYNC (synchronization) instruction is provided in the instruction set of the processor. The effect of this SYNC instruction is to cause the supervisor thread SV to wait until all the working threads W executing simultaneously have been output by means of an EXIT instruction. In embodiments the SYNC instruction takes a mode in the form of an operand (in embodiments it is only the operand), the mode specifying whether the SYNC instruction should act only locally in relation to only the work threads that run locally on the same processor module 4, for example the same keypad, as the supervisor as part on which the SYNC action is executed (i.e. only the execution threads in the same pipeline 13 of the same barrel wire processing unit 10); or if instead it should be applied to
B17782 FR-409133FR multiple blocks or even on multiple chips.
SYNC mode // mode G [tile, chip, zone_l, zone_2} [0107] This will be described in more detail later but with regard to FIG. 6 a local SYNC will be assumed (SYNC tile, that is to say a synchronization in a single block).
The working threads do not need to be identified as operands of the SYNC instruction, since it is implicit that the supervisor SV is then brought to wait automatically for none of the time slots S0, SI,. .. of the barrel wire processing unit 10 is occupied by a working wire. As shown in FIG. 6, once all the wires in a current batch of working wires WLn have been launched by the supervisor, the supervisor then executes a SYNC instruction. If the supervisor SV launches working wires W in all the slots S0 ... 3 of the barrel wire processing unit 10 (all four in the example illustrated, but this is only one example of implementation), then the SYNC instruction will be executed by the supervisor once the first element of the current batch of working threads WLn is out, thus returning control of at least one slot to the supervisor SV. Otherwise, if the work threads do not take up all of the slots, the SYNC instruction will simply be executed immediately after the last thread in the current batch WLn has been launched. In any case, the SYNC instruction causes the supervisor SV to wait for all the other elements of the current batch of work threads WLn-1 to execute an EXIT before the supervisor can proceed. It is only after this that the supervisor executes a GET instruction to obtain the content of the local consensus register ($ LC) 38. This wait by the supervisor thread is imposed by the hardware once the SYNC instruction has been executed. That is to say, in
B17782 FR-409133FR response to the operation code of the SYNC instruction, the logic located in the execution unit (EXU) of the execution stage 18 brings the extraction stage 14 and the scheduler 24 to pause in the issuance of supervisor thread instructions until all pending work threads have executed an EXIT instruction. At some point after obtaining the value of the local consensus register ($ LC) 38 (optionally with another piece of supervisor code in between), the supervisor executes a PUT instruction to reset the local consensus register ($ LC) 38 (at 1 in the example illustrated).
As also illustrated in FIG. 6, the SYNC instruction can also be used to place synchronization barriers between different interdependent layers WL1, WL2, WL3, ... of working wires, where one or more threads in each successive layer is dependent on data produced by one or more working threads in the previous layer. The local SYNC executed by the supervisor thread guarantees that none of the work threads in the next layer WLn + 1 is executed before all the work threads in the immediately preceding layer WLn are out (by executing a EXIT instruction).
As has been mentioned, in embodiments, the processor module 4 can be implemented in the form of a matrix of interconnected blocks forming a multi-block processor, each of the blocks being able to be arranged as described previously in relationship to Figures 1 to 6.
This is illustrated in FIG. 7 which represents a processor in a single chip 2, that is to say a single elementary chip, comprising a matrix 6 of multiple processor blocks 4 and an interconnection on the chip 34
B17782 FR-409133FR interconnecting blocks 4. Chip 2 can be implemented alone in its own integrated circuit package with a single chip, or in the form of multiple elementary chips packaged in the same integrated circuit package, IC. The interconnection on the chip can also be called here exchange fabric 34 since it allows the blocks 4 to exchange data with one another. Each block 4 comprises a respective instance of the barrel wire processing unit 10 and a memory 11, each arranged as described above in relation to FIGS. 1 to 6. For example, by way of illustration, the chip 2 may comprise on the order of a hundred paving stones 4, or even more than a thousand. To be complete, it will also be noted that a matrix as designated here does not necessarily imply a particular number of dimensions or a particular physical arrangement of the blocks 4.
In embodiments each chip 2 also includes one or more external links 8, allowing the chip 2 to be connected to one or more other external processors on different chips (for example one or more other instances of the same bullet 2). These external links 8 may include any one or more of: one or more chip-to-host links to connect the chip 2 to a host processor, and / or one or more chip to chip links to connect with one or more other instances of chip 2 on the same IC box or the same card, or on different cards. In an exemplary arrangement, the chip 2 receives work from a host processor (not shown) which is connected to the chip via one of the chip-to-host links in the form of input data to be processed by the chip 2. Multiple instances of the chip 2 can be connected together in cards by chip-to-chip links. So
B17782 FR-409133FR a host can access a computer which has an architecture comprising a processor in a single chip 2 or comprising multiple processors in a single chip 2 optionally arranged on multiple interconnected cards, depending on the workload required for the host application.
The interconnection 34 is arranged to allow the various processor blocks 4 located in the matrix 6 to communicate with each other on the chip 2. However, as there may be potentially dependencies between execution threads on the same block 4, there may also be dependencies between the portions of the program running on different blocks 4 in matrix 6. A technique is therefore necessary to prevent a piece of code on a given block 4 from running ahead of data on which it depends which is made available by another piece of code on another block 4.
This can be obtained by implementing a massive synchronous parallel exchange scheme (BSP), as illustrated schematically in Figures 8 and 9.
According to a version of BSP, each block 4 performs a calculation phase 52 and an exchange phase 50 in an alternating cycle, separated from each other by a barrier synchronization 30 between the blocks. In the illustrated case, a barrier synchronization is placed between each calculation phase 52 and the next exchange phase 50. During the calculation phase 52 each block 4 performs one or more calculation tasks locally on the block, but does not communicate the results of these calculations to other blocks 4. In the exchange phase 50 each block 4 is authorized to exchange one or more calculation results from the previous calculation phase with one or more of the other blocks in the
B17782 FR-409133FR group, but does not carry out any new calculation before having received from other blocks 4 the data on which its task or its tasks depend. It also does not send data to other blocks, except those calculated in the previous calculation phase. It is not excluded that other operations such as operations associated with internal control can be carried out in the exchange phase. In certain embodiments, the exchange phase 50 does not include non-deterministic calculations over time, but a small number of deterministic calculations over time can optionally be authorized during the exchange phase 50. It will also be noted that a block 4 carrying out a calculation can be authorized during the calculation phase 52 to communicate with other system resources external to the matrix of blocks 4 which is synchronized - for example a network card, a disk drive, or a network of doors programmable on site (FPGA) - as long as it does not involve communication with other blocks 4 in the group which is synchronized. Communication external to the group of blocks can optionally use the BSP mechanism, but as a variant may not use the BSP and instead use another synchronization mechanism of its own.
According to the BSP principle, a barrier synchronization 30 is placed at the join making the transition between the calculation phases 52 and the exchange phase 50, or the join making the transition between the exchange phases 50 and the calculation phase 52, or both. This means that either: (a) all the blocks 4 must complete their respective calculation phases 52 before one of the blocks in the group is authorized to proceed to the next exchange phase 50, or (b ) all the blocks 4 in the group must complete their respective exchange phases 50 before one of the blocks in the group is authorized to proceed to the next calculation phase 52,
B17782 FR-409133FR either (c) these two conditions are imposed. In the three variants it is the individual processors which alternate between the phases, and the overall assembly which synchronizes. The sequence of exchange and calculation phases can then be repeated multiple times. In BSP terminology, each repetition of exchange phase and calculation phase is sometimes called super-step (it should be noted, however, that in the literature terminology is not always used consistently: sometimes each exchange phase and each individual calculation phase is individually called super-steps, while elsewhere, as in the terminology adopted here, the exchange and calculation phases together are called super-steps).
It will also be noted that it is not excluded that multiple groups of independent 4 blocks on the same chip 2 or on different chips may each form a separate respective BSP group operating asynchronously with respect to one to the other, the BSP cycle of calculation, synchronization and exchange being imposed only within each given group, but each group doing so independently of the other groups. That is, a multi-block matrix 6 can include multiple internally synchronous groups each operating independently and asynchronously with respect to the others of these groups (described in more detail below). In some embodiments, there is a hierarchical synchronization and exchange grouping, as will be described in more detail below.
FIG. 9 illustrates the BSP principle as it is implemented among a group 4i, 4ii, 4iii of some or all of the blocks of the matrix 6, in the case which requires: (a) synchronization with barrier between the calculation phase 52 and the exchange phase 50 (see above). Note that in this
B17782 FR-409133FR layout, some blocks 4 are authorized to start calculating 52 while others are still exchanging.
According to the embodiments described here, this type of BSP can be facilitated by incorporating additional, special and dedicated functionalities in a machine code instruction for achieving barrier synchronization, that is to say the instruction SYNC.
In some embodiments, the SYNC function takes this functionality when it is qualified by an inter-block mode as an operand, for example the mode on the chip: SYNC chip, [0121] This is illustrated diagrammatically in FIG. 10 In the case where each block 4 comprises a multi-thread processing unit 10, each calculation phase 52 of a block can in fact comprise tasks carried out by multiple working wires W on the same block 4 (and a phase computation 52 given on a given block 4 can comprise one or more WL layers of work wires, which in the case of multiple layers can be separated by internal barrier synchronizations using the SYNC instruction with the local mode on the block as operand, as described above). Once the supervisor wire SV on a given block 4 has launched the last work wire in the current BSP super-step, the supervisor being on this block 4 then executes a SYNC instruction with the inter-block mode set as operand: SYNC chip. If the supervisor must launch (RUN) work threads in all the slots of his respective processing unit 10, the SYNC chip is executed as soon as the first slot which is no longer necessary to put other work threads in RUN in the current BSP super-step is returned to the supervisor. For example, this can happen after the
B17782 FR-409133FR first wire has made an EXIT in the last layer WL, or simply after the first working wire has made an EXIT if there is only one layer. Otherwise, if the slots are not all to be used for work threads executing in the current BSP super-step, the SYNC chip can be executed as soon as the last work thread to be put in RUN in the super step Current BSP has been launched. This can happen after all the work threads in the last layer have been put into RUN, or simply after all the work threads have been put in RUN if there is only one layer.
The execution unit (EXU) of the execution stage 18 is arranged so as, in response to the operation code of the SYNC instruction, when it is qualified by the operand on the chip ( inter-block), cause the supervisor wire in which the SYNC chip was executed to be paused until all the blocks 4 of the matrix 6 have completed the execution of the work wires. This can be used to implement a barrier to the next BSP super-step, i.e. after all the blocks 4 on chip 2 have passed the barrier, the program passing through the blocks as a whole can progress to the next exchange phase 50.
FIG. 11 is a diagram illustrating the logic triggered by a SYNC chip according to the embodiments described here.
Once the supervisor has launched (RUN) all the execution threads that he must launch in the current calculation cycle 52, he executes a SYNC instruction with the inter-block operand, on the chip: SYNC chip. This causes the following functionality to be triggered in the dedicated synchronization logic 39 on block 4, and in a
B17782 FR-409133FR synchronization controller 36 implemented in hardware interconnection 34. This functionality of both synchronization logic 39 on the pad and synchronization controller 36 in interconnection 34 is implemented in a circuit dedicated hardware so that, once the SYNC chip is executed, the rest of the functionality proceeds without further instructions being executed to do so.
First, the synchronization logic on block 39 causes the issuance of instructions for the supervisor on block 4 in question to automatically pause (brings the extraction stage 14 and the scheduler 24 to suspend the issuance of instructions from the supervisor). Once all the pending work wires on the local block 4 have performed an EXIT, the synchronization logic 39 automatically sends a synchronization request sync_req to the synchronization controller 36 in the interconnection 34. The local block 4 then continues to wait with the issuance of supervisor instructions on pause. A similar process is also implemented on each of the other blocks 4 in the matrix 6 (each comprising its own instance of the synchronization logic 39). Thus, at a certain point, once all the final work threads in the current calculation phase 52 have made an EXIT on all the blocks 4 of the matrix 6, the synchronization controller 36 will have received a respective synchronization request ( sync_req) from all the blocks 4 of the matrix 6. It is only then, in response to the reception of the sync_req coming from each block 4 of the matrix 6 on the same chip 2, that the synchronization controller 36 returns a sync_ack synchronization acknowledgment signal to synchronization logic 39 on each of the blocks 4. Up to this point, each of the blocks 4 has had its transmission of supervisor instructions
B17782 FR-409133FR paused waiting for the synchronization acknowledgment signal (sync_ack). Upon reception of the sync_ack signal, the synchronization logic 39 located in block 4 automatically ends the pause in the transmission of supervisor instructions for the respective supervisor wire on this block 4. The supervisor is then free to proceed to an exchange of data with other blocks 4 via the interconnection 34 in a next exchange phase 50.
Preferably the sync_req and sync_ack signals are sent and received to and from the synchronization controller, respectively, via one or more dedicated synchronization wires connecting each block 4 to the synchronization controller 36 in the interconnection 34.
In addition, according to embodiments described here, additional functionality is included in the SYNC instruction, that is to say, that at least when it is executed in an inter-block mode (by example SYNC chip), the SYNC instruction also causes the local output states $ LC of each of the synchronized blocks 4 to be automatically aggregated into additional dedicated hardware 40 in the interconnection 34. In the embodiments shown, this logic takes the shape of an AND gate with multiple inputs (one input for each block 4 of the matrix 6), for example formed from a chain of AND doors with two inputs 40i, 40ii, ... as shown example in Figure 11. This inter-block aggregation logic 40 receives the value found in the local output status register (local consensus register) $ LC 38 from each block 4 of the matrix - in embodiments each being a single bit and aggregate these values into a single value, for example an AND of all locally aggregated output states. So the
B17782 FR-409133FR logic forms a globally aggregated output state on all the execution threads on all the blocks 4 of the matrix 6.
Each of the blocks 4 comprises a respective instance of a global consensus register ($ GC) 42 arranged to receive and store the global output state coming from the global aggregation logic 40 in the interconnection 34. In some embodiments this is another of the state registers found in the CXS context register bank of the supervisor. In response to the synchronization request (sync_req) received from all the blocks 4 of the matrix 6, the synchronization controller 36 causes the output of the aggregation logic 40 (for example the output of the AND) to be stored in the register of global consensus ($ GC) 42 on each block 4 (it will be noted that the switch represented in FIG. 11 is a schematic representation of the functionality and that in fact the update can be implemented by any appropriate digital logic). This register $ GC 42 is accessible by the supervisor wire SV on the respective block 4 once the transmission of supervisor instructions is resumed. In some embodiments, the global consensus register $ GC is implemented in the form of a command register in the command register bank so that the supervisor wire can obtain the value found in the global consensus register ($ GC) 42 using a GET instruction. Note that the synchronization logic 36 waits for the sync_req to be received from all blocks 4 before updating the value in any of the global consensus registers ($ GC) 42, otherwise an incorrect value can be made available to a supervisor wire on a block which has not yet completed its part of the calculation phase 52 and which is therefore still being executed.
B17782 FR-409133EN [0129] The aggregate aggregate output state $ GC allows the program to determine an overall output of parts of the program executing on multiple different blocks 4 without having to examine the state of each work thread individually individually on each individual paving stone. It can be used for any purpose desired by the programmer. For example, in the example shown in Figure 11 where the global aggregate is a Boolean AND, this means that any entry at 0 leads to an aggregate of 0, but if all the entries are at 1 then the aggregate is worth 1. C that is, if a 1 is used to represent a true exit or a success, it means that if any of the local exit states of one of the blocks 4 is false or in failure, then the global aggregate state will also be false or will represent a failed exit. For example, this could be used to determine if the portions of code running on all tiles have all met a predetermined condition. So the program can query a single register (in some embodiments a single bit) to ask is something went wrong Yes or no or have all the nodes of the graph reached an acceptable level of error Yes or no Rather than having to examine the individual states of the individual work threads on each individual pad (and again, in embodiments the supervisor is actually not able to query the state of the working wires except by output status registers 38, 42). In other words, each of the EXIT and SYNC instructions reduces the multiple individual output states to a single combined state.
In an example of a use case, the supervisor located on one or more of the blocks can report to a host processor if the global aggregate has indicated a false or failed output. In another example, the program can
B17782 FR-409133FR make a connection decision based on the overall output state. For example, the program examines the aggregate aggregate output state $ GC and on the basis of this determines whether it should continue to loop or should branch elsewhere. If the global output state $ GC is always false or in failure, the program continues its iteration of the same first part of the program, but once the global output state $ GC is true or successful, the program branches to a second, different part of the program. The connection decision can be implemented individually in each supervisor wire, or by the fact that one of the supervisors takes the role of master and gives instructions to the other slave supervisors on the other blocks (the master role being configured by software ).
Note that the aggregation logic 40 shown in Figure 11 is only an example. In another equivalent example, the AND can be replaced by an OR, and the interpretation of 0 and 1 can be reversed (0 -> true, 1 -> false). Equivalently if the AND gate is replaced by an OR gate but the interpretation of the output states is not reversed, and neither is the reset value, then the aggregate state in $ GC will save if any (instead of all) of the blocks is output with the locally aggregated state 1. In another example, the global output state $ GC can comprise two bits representing a trinary state: all the locally aggregated output states $ LC of pavers had state 1, all locally aggregated $ LC output states of the pavers had state 0, or the locally aggregated $ LC output states of the pavers were mixed. In another more complex example, the local exit states of blocks 4 and the globally aggregated exit state may each comprise two or more bits, which may be used, for example, to represent a degree of confidence in the
B17782 FR-409133EN results of blocks 4. For example, the locally aggregated output state $ LC of each individual block may represent a statistical probabilistic measure of confidence in a result of the respective block 4, and the global aggregation logic 40 may be replaced by a more complex circuit to achieve a statistical aggregation of individual confidence levels by hardware.
As previously mentioned, in embodiments multiple instances of the chip 2 can be connected together to form an even larger array of blocks 4 spanning multiple chips 2. This is illustrated in Figure 12. Some or all of the chips 2 can be implemented on the same IC package or some or all of the chips 2 can be implemented on different IC packages. The chips 2 are connected to each other by an external interconnection 72 (via the external links 8 shown in FIG. 7). This can make a connection between 2 chips on the same IC box, different IC boxes on the same card, and / or different IC boxes on different cards). In addition to providing a conduit for exchanging data between blocks 4 located on different chips, the external interconnection 72 also provides hardware support for achieving barrier synchronization between blocks 4 located on different chips 2 and aggregating the local output states of the blocks 4 located on the different chips 2.
In certain embodiments, the SYNC instruction can take at least one other possible value from its mode operand to specify an external synchronization, that is to say inter-chip: SYNC zone_n, where zone_n represents a zone of external synchronization. The external interconnection 72 includes hardware logic similar to that described in relation to FIG. 11, but on an inter57 scale
B17782 FR-409133FR an external synchronization zone of two or specified in its operand, this brings the logic found in the external interconnection to
operate similarly to that described in connection with the internal interconnection 34, but in all of the blocks on the multiple different chips 2 in the specified synchronization zone.
That is to say that, in response to the operation code of the SYNC instruction, the operand of which specifies external synchronization, the execution stage 18 causes the level of synchronization specified by the operand to be signaled to dedicated hardware synchronization logic 76 in external interconnection 72. In response to this, synchronization logic 76 in external interconnection leads the synchronization request (sync_req) and acknowledgment (sync_ack) process to be carried out only between all the external blocks 4, for example all the blocks in the set of all the chips 2 of the system for global synchronization. That is, the synchronization logic 76 located in the external interconnection 72 will send the synchronization acknowledgment signal (sync_ack) to blocks 4 in the set of chips 2 only once a synchronization request (sync_req) was received from all blocks 4 coming from these chips. All of the blocks 4 on all of these chips 2 will be automatically paused until the synchronization acknowledgment (sync_ack) from the external synchronization logic 76 is returned.
Thus, in response to an external SYNC, the transmission of instructions from the supervisor is paused until all the blocks 4 on all the chips 2 in the external synchronization zone have completed their calculation phase 52
B17782 FR-409133FR and have submitted a synchronization request. In addition, the logic located in the external interconnection 72 aggregates the local output states of all these blocks 4, in the set of multiple chips 2 in the area in question. Once all blocks 4 in the external synchronization zone have made the synchronization request, the external interconnection 72 sends a synchronization acknowledgment signal to blocks 4 and stores the overall aggregate output state at chip level in the global consensus registers ($ GC) 42 of all the blocks 4 in question. In response to the synchronization acknowledgment, blocks 4 on all chips 2 in the area resume the transmission of instructions to the supervisor.
It will be noted that in embodiments, the functionality of the interconnection 72 can be implemented in the chips 2, that is to say that the logic can be distributed between the chips 2 so that only wired connections between chips are required (Figures 11 and 12 are schematic).
All the blocks 4 located in the mentioned synchronization zone are programmed to indicate the same synchronization zone via the mode operand of their respective SYNC instructions. In embodiments, the synchronization logic 76 in the external interconnect device 72 is arranged so that, if this is not the case due to a programming error or other error (such as a memory parity error), some or all of blocks 4 will not receive an acknowledgment, and therefore the system will stop at the next external barrier, thus allowing an external management CPU (for example host) to intervene for debugging or system recovery. However, preferably the compiler is arranged to ensure that the
B17782 FR-409133FR Pavers in the same area all indicate the same correct synchronization area at the time concerned. The synchronization logic can also be arranged to take other alternative or additional measures in the event of inconsistency in the modes indicated by the various SYNC instructions, for example signaling an exception to the external CPU, and / or stopping the execution. by another mechanism.
As illustrated in FIG. 14, in embodiments the mode of the SYNC instruction can be used to specify one of multiple different possible external synchronization zones, for example zone_l or zone_2. In embodiments, this corresponds to different hierarchical levels. That is to say that each upper hierarchical level 92 (for example zone 2) encompasses two or more zones 91A, 91B of at least one lower hierarchical level. In some embodiments, there are only two hierarchical levels, but higher numbers of nested levels are not excluded. If the operand of the SYNC instruction is set to the lower hierarchical level of the external synchronization zone (SYNC zone_l), then the synchronization and aggregation operations described above are carried out in relation to blocks 4 on chips 2 only in the same lower level external synchronization area as the block on which the SYNC was executed. If, on the contrary, the operand of the SYNC instruction is set to the higher hierarchical level of the external synchronization zone (SYNC zone_2), then the synchronization and aggregation operations described above are performed automatically in relation to all blocks 4 on all chips 2 in the same external level synchronization area higher than the block on which the SYNC was executed.
B17782 FR-409133EN [0139] In response to the fact that the operation code of the SYNC instruction includes an external synchronization zone as operand, the execution stage 18 causes the synchronization level specified by the operand to be signaled to of the dedicated hardware synchronization logic 76 in the external interconnection 72. In response to this, the synchronization logic 76 located in the external interconnection leads the synchronization request process (sync_req) and acknowledgment of receipt (sync_ack ) to be carried out only between blocks 4 of the group indicated. That is, the synchronization logic 76 located in the external interconnection 72 will send the synchronization acknowledgment signal (sync_ack) to the blocks located in the synchronization area signaled only once a synchronization request (sync_req) has been received from all blocks 4 in this zone (but will not wait for other blocks outside this zone if it is not a global synchronization).
It will be noted that in other embodiments, the synchronization zones which can be specified by the mode of the SYNC instruction are not limited to being hierarchical in nature. In general, a SYNC instruction can be provided with modes corresponding to any sort of grouping. For example, modes can allow selection from only non-hierarchical groups, or a mixture of hierarchical groupings and one or more non-hierarchical groups (where at least one group is not entirely nested within another). This advantageously gives flexibility to the programmer or the compiler, with a minimum code density, to select between different arrangements of internally synchronous groups which are asynchronous with each other.
B17782 FR-409133EN [0141] An example of a mechanism for implementing synchronization between the synchronization groups 91, 92 selected is illustrated in FIG. 18. As illustrated, the external synchronization logic 76 in the external interconnection 72 comprises a respective synchronization block 95 associated with each respective chip 2. Each synchronization block 95 comprises respective door logic and a respective synchronization aggregation device. The gate logic comprises hardware circuits which connect the chips 2 together in a daisy chain topology for the purpose of synchronization and aggregation of output states, and which propagate the synchronization and the output state information of the following way. The synchronization aggregation device comprises hardware circuits arranged to aggregate the synchronization requests (sync_req) and the output states in the following manner.
The respective synchronization block 95 associated with each chip 2 is connected to its respective chip 2, so that it can detect the synchronization request (Sync_req) sent by this chip 2 and the output state of this chip 2, and so that it can send back the synchronization acknowledgment (Sync_ack) and the overall output state to the respective chip 2. The respective synchronization block 95 associated with each chip 2 is also connected to the synchronization block 95 of at least one other of the chips 2 via an external synchronization interface comprising a bundle of four synchronization wires 96, of which details will be described more precisely in a moment. This can be part of one of the chip-to-chip links 8. In the case of a link between chips 2 on different cards, the interface 8 can for example comprise a PCI interface and the four wires of
B17782 FR-409133FR synchronization 96 can be implemented by reusing four wires from the PCI interface. Some of the synchronization blocks 95 of the chips are connected to those of two adjacent chips 2, each connection being made via a respective instance of the four synchronization wires 96. In this way, the chips 2 can be connected in one or multiple strings through their synchronization blocks 95. This allows synchronization requests, synchronization acknowledgments, current output state aggregates and global output states to be propagated up and down. bottom of the chain.
In operation, for each synchronization group 91, 92, the synchronization block 95 associated with one of the chips 2 in this group is put as master for the purpose of synchronization and aggregation of output states, the rest of the group being slaves for that. Each of the slave synchronization blocks 95 is configured with the direction (for example left or right) in which it needs to propagate synchronization requests, synchronization acknowledgments and output states for each synchronization group 91, 92 (i.e. direction to the master). In certain embodiments these settings are configurable by software, for example in an initial configuration phase after which the configuration remains in place after the operation of the system. For example this can be configured by the host processor. As a variant, it is not excluded that the configuration can be wired. In any case, the different synchronization groups 91, 92 can have different masters and in general it is possible that a given chip 2 (or rather its synchronization block 95) is
B17782 FR-409133FR the master of a group and not of another group of which it is a member, or be the master of multiple groups.
For example, let us consider the example of a scenario in FIG. 18. Let us say by way of example that the synchronization block 95 of the 2IV chip is put as master of a given synchronization group 91A. Now consider the first chip 21 in the chain of chips 2, connected via their synchronization blocks 95 and wires 96 last to the chip 2IV. When all the work threads of the current calculation phase on the first chip 21 have executed an EXIT instruction, and the supervisors on all the blocks 4 (participants) have all executed a SYNC instruction specifying the synchronization group 91A, then the first puce signals that it is ready for synchronization at its respective associated synchronization block 95. The chip also provides its chip level synchronization block (the aggregate of all the working wires exiting on all the participating blocks on the respective chip 21).
In response, the synchronization block 95 of the first chip propagates a synchronization request (Sync req) to the synchronization block 95 of the next chip 211 in the chain. It also propagates the output state of the first chip 21 to the synchronization block 95 of this next chip 211. The synchronization block 95 of this second chip 211 waits for the supervisors of its own blocks 4 to all have executed a SYNC instruction specifying the synchronization group 91A, bringing the second chip
211 to indicate that it is ready for synchronization. This is synchronization synchronization to chip 95 propagates a request for synchronization block 95 from the next chip 2III (the third) in the chain, and also propagates
B17782 FR-409133FR a current aggregate of the output state of the first chip 21 with that of the second chip 2IÏ. if the second chip 211 had become ready for synchronization before the first 21, then the synchronization block 95 of the second chip 211 would have waited for the first chip 21 to signal a synchronization request before propagating the synchronization request to the block of synchronization 95 of the third chip 2III. The synchronization block 95 of the third chip 2III behaves in a similar manner, this time aggregating the current aggregate output state from the second chip 211 to obtain the next current aggregate to go forward, etc. This continues towards the master synchronization block, that of the 2IV chip in this example.
The master's synchronization block 95 then determines a global aggregate of all the output states on the basis of the current aggregate that it receives and the output state of its own 2IV chip. It propagates this global aggregate by
back along the chain to all chips 2, along with 1'accusé of re synchronization concept (Sync ack).If the master is at halfway in a chain, in contrary by the way of to be at an end given as
in the aforementioned example, then the synchronization and output status information propagates in opposite directions on each side of the master, on both sides towards the master. In this case, the master only issues the synchronization acknowledgment and the overall output status once the synchronization request from both sides has been received. For example, consider the case where chip 2III is master of group 92. In addition, in embodiments the synchronization block 95 of some of the chips 2 could be connected to that of three or more other chips 2, thus creating multiple branches of chains in
B17782 FR-409133FR direction of the master. Each chain then behaves as described above, and the master only issues the synchronization acknowledgment and the overall exit status once the synchronization request has been received from all the chains. And / or, one or more of the chips 2 could be connected to an external resource such as the host processor, a network card, a storage device or an FPGA.
In certain embodiments, the signaling of the synchronization and of the output state information is implemented in the following manner. The bundle of four synchronization wires 96 between each pair of chips 2 comprises two pairs of wires, a first pair 96_0 and a second pair 96_1. Each pair includes an instance of a synchronization request thread and an instance of a synchronization acknowledgment thread. To signal a current aggregate output state of value 0, the synchronization block 95 of the sending chip 2 uses the synchronization request wire of the first pair of wires 96_0 when signaling the synchronization request (sync_req), or for signaling a current aggregate of value 1 the synchronization block 95 uses the synchronization request wire of the second pair of wires 96_1 when signaling the synchronization request. To signal a global aggregate output state of value 0, the synchronization block 95 of the transmitting chip 2 uses the synchronization acknowledgment wire of the first pair of wires 96_0 when signaling the acknowledgment of receipt. synchronization (sync_ack), or to signal a global aggregate of value 1 the synchronization block 95 uses the synchronization request wire of the second pair of wires 96_1 when signaling the synchronization acknowledgment.
B17782 FR-409133FR [0148] It will be noted that what has just been mentioned is only the mechanism intended for the propagation of the synchronization and of the output state information. The actual data (content) is transmitted by another channel, for example as described below with reference to FIG. 19. In addition, it will be noted that this is only an example of implementation, and the A person skilled in the art will be able to construct other circuits to implement the synchronization described and the aggregation functionality once the specification of this functionality described here has been given. For example, the synchronization logic (95 in Figure 18) could instead use packets transported on interconnect 34, 72 as an alternative to dedicated wiring. For example, sync_req and / or sync_ack could each be transmitted as one or more packets.
As mentioned previously, all of the blocks 4 need not necessarily participate in the synchronization. In embodiments, as described, the group of participating tiles can be defined by the mode operand of the synchronization instruction. However, this only allows the selection of predefined tile groups. It is recognized here that it would also be desirable to be able to select participation in the synchronization block by block. Consequently, in embodiments, an alternative or additional mechanism is provided for selecting the individual blocks 4 which participate in the barrier synchronization.
In particular, this is obtained by providing an additional type of instruction in the instruction set of the processor, to be executed by one or some of the blocks 4 in place of the SYNC instruction. This instruction can be called abstention instructions, or WITHOUT instruction
B17782 FR-409133FR (non-participative synchronization with automatic start). In embodiments, the SANS instruction is reserved for use by the supervisor wire. In embodiments it takes a single immediate operand:
WITHOUT n_barriers [0151] The behavior of the instruction WITHOUT is to cause the block on which it is executed to abstain from synchronization with current barrier, but without retaining the other blocks which are currently waiting for all the tiles in the specified synchronization group execute SYNC. Indeed, she says go on without me. When the SANS instruction is executed, the operation code of the SANS instruction triggers the logic in the execution unit of the execution stage 18 to send an instance of the synchronization request signal (Sync_req) to the controller. internal and / or external synchronization 36, 76 (depending on the mode). In embodiments, the synchronization request generated by the SANS instruction applies to any synchronization group 91, 92 which includes block 4 which executed the SANS instruction. That is to say, whatever the synchronization group that blocks 4 in the local chip (s) then use (they must agree on the synchronization group), the sync_req coming from those who executed l instruction WITHOUT will always be valid.
Thus, from the point of view of the logic of the synchronization controller 36, 76 and of the other blocks 4 in the synchronization group, the block 4 executing the SANS instruction appears exactly as being a block 4 executing a SYNC instruction, and does not retain the synchronization barrier and the sending of the synchronization acknowledgment signal (Sync_ack) coming from the logic of
B17782 FR-409133FR synchronization 36, 76. That is to say that the blocks 4 executing the SANS instruction instead of the SYNC instruction do not retain or block any of the other blocks 4 involved in a synchronization group whose the pad in question is otherwise a member. Any handshake performed by an SANS instruction is valid for all synchronization groups 91, 92.
However, unlike the SYNC instruction, the SANS instruction does not cause the transmission of supervisor instructions to be paused while awaiting the synchronization acknowledgment signal (Sync_ack) coming from the synchronization logic 36, 76. Instead, the respective block can simply continue without being inhibited by the current barrier synchronization carried out between the other blocks 4 which have executed SYNC instructions. Thus, by imitating a synchronization but without waiting, the SANS instruction allows its block 4 to continue processing one or more tasks while still allowing the other blocks 4 to synchronize.
The operand n_barriers specifies the number of synchronizations posted, that is to say the number of future synchronization points (barriers) in which the block will not participate. As a variant, it is not excluded that in other embodiments the SANS instruction does not take this operand, and that instead each execution of the SANS instruction causes only one abstention once.
By means of the SANS instruction, certain blocks 4 may be responsible for carrying out tasks outside the direct scope of the BSP operating diagram. For example, it may be desirable to allocate to a small number of blocks 4 in a chip 2 the launch (and the
B17782 FR-409133EN processing) of data transfers to and / or from the host memory while the majority of blocks 4 are occupied with the primary calculation task or tasks. In such scenarios, these blocks 4 which are not directly involved in a primary calculation can declare themselves as effectively disconnected from the synchronization mechanism for a period of time using the automatic non-participatory synchronization (SANS) functionality. When using this functionality, a block 4 is not obliged to actively signal (i.e. by executing the SYNC instruction), its ability to be ready for synchronization (for any synchronization zones), and in embodiments makes zero contribution to the aggregate output state.
The SANS instruction begins or extends a period during which block 4 on which it is executed will abstain from active participation in an inter-block synchronization (or synchronization with other external resources if they are also involved in synchronization). During this period, this block 4 will automatically signal its ability to be ready for synchronization, in all the zones, and in embodiments will also make a zero contribution to the aggregate aggregate consensus $ GC. This period of time can be expressed in the form of an immediate unsigned operand (n_barriers) indicating how many future additional synchronization points will be signaled automatically by this block 4. At the execution of the instruction WITHOUT, the value n_barriers specified by its operand is placed in a countdown register $ ANS_DCOUNT on the respective block 4. This is an architectural state element used to keep track of the number of future additional sync_reqs that need to be done. If the
B17782 FR-409133FR automatic non-participatory synchronization mechanism the first affirmation of the capacity to be ready (synchronization request, sync req) will be carried out immediately.
Following statements will occur in the background, once the previous synchronization is complete following the synchronization statement, sync_ack). If the automatic non-part synchronization mechanism is currently active, the countdown counter register $ ANS_DCOUNT will be updated automatically, so that no synchronization acknowledgment signal is left without counting. The automatic non-participatory synchronization mechanism is implemented in dedicated hardware logic, preferably with an instance of it in each block 4, although in other embodiments it is not excluded that at the place it can be implemented centrally for a group of pavers or all pavers.
With regard to the behavior of the output state, there are in fact a certain number of possibilities which depend on the implementation. In embodiments, in order to obtain the globally aggregated output state, the synchronization logic 36, 76 aggregates only the local output states originating from blocks 4 being in the specified synchronization group which have executed a SYNC instruction, and not of the one or those who executed an instruction WITHOUT (the block (s) which abstain). As a variant, the globally aggregated output state is obtained by aggregating the local output states originating from all the blocks 4 which are in the synchronization group which has executed a SYNC and those which have executed an instruction WITHOUT (both the 4 participants and the abstaining pavers). In the latter case, the local exit status output provided by the block (s)
B17782 FR-409133FR which abstain for a global aggregation can be the locally effective aggregate output state of the working threads of this block at the time of execution of the SANS instruction, just as with the SYNC instruction (see the description of the local consensus register of $ LC 38). As a variant, the local exit state produced by the block 4 which abstains may be a default value, for example the true value (for example the logic state 1) in embodiments where the exit state is binary. This prevents the abstaining block 4 from interfering with the overall exit state in embodiments where a false local exit state causes the global exit state to be false.
Regarding the return of the global exit state, there are two possibilities, regardless of whether the abstaining block submits a local exit state or not to produce the local aggregate, and independently whether this value is an effective value or a default value. That is to say that, in one implementation, the global aggregate output state produced by the synchronization logic 36, 76 located in the interconnection 34, 72 is stored only in the global consensus registers $ GC 42 of the 4 participating blocks which executed a SYNC instruction, and not of the 4 abstaining blocks which instead executed a SANS instruction. In some embodiments, instead a default value is stored in the global consensus register $ GC 42 of the block (s) 4 which have executed an instruction WITHOUT (the blocks which abstain). For example, this default value can be true, for example logic state 1, in the case of a binary global output state. However, in an implementation variant, the effective global aggregate produced by the synchronization logic 36, 76 is stored in the global consensus registers $ GC 42 of the blocks 4 participants which have executed
B17782 FR-409133EN SYNC instructions and 4 refraining blocks which instead executed a SANS instruction. Thus all the blocks of the group can still have access to the globally aggregated output state.
FIG. 13 illustrates an example of a program flow involving both internal synchronization (on the chip) and external synchronization (inter-chip). As shown, the flow includes internal exchanges 50 (of data between blocks 4 on the same chip 2) and external exchanges 50 '(of data between blocks 4 on different chips 2).
As illustrated in FIG. 13, according to the present description, it is described to keep the internal BSP super-steps (this comprising the internal exchanges 50 of data between blocks 4 on the same chip 2) separated from the synchronization and from the external exchange (comprising the external exchanges 50 ′ of data between blocks 4 on 2 different chips).
One reason for keeping the internal and external BSPs separate is that, in embodiments, the exchange of data via the internal interconnection (on the chip) 34 can be made deterministic over time, as will be described in more detail in a moment with reference to Figures 16 and 17; whereas the exchange of data via the external interconnection 72 can be non-deterministic over time, for example due to a lossy physical channel which will request the retransmission of messages. In general, an external interconnection could be made deterministic over time, but it could be difficult to do so or it could confer too few advantages over a non-deterministic interconnection, and thus might not be implemented in practice.
B17782 FR-409133EN [0162] In such embodiments, it would be desirable to maintain the deterministic internal communications over time so that they can be conducted without the need for queues in the internal interconnection 34, since queues would result in an undesirable silicon surface in interconnection 34. However, in embodiments external communications may not be deterministic If each BSP super-step was a global exchange, then temporal determinism would be contaminated by non-deterministic external exchanges over time. This is due to the fact that once a given block or wire has performed an external exchange, the temporal determinism is lost and cannot be recovered before the next barrier synchronization.
As will be described in a moment in more detail, the communication without queues can be obtained by the compiler by knowing the moment when each block 4 transmits its data, and also by knowing the time interval over a chip between the emitting pad and the receiving pad. Given this predetermined knowledge, the compiler can then program the reception block to listen to the address of the transmitter block at a specific known time after the transmission of the data concerned by the transmitter block, that is to say the time of transmission plus the inter-block delay. The temporal characteristics of the transmission are known by the compiler since the compiler itself selects to what point in each thread it is necessary to include the sending instruction or instructions. In addition, the inter-block delay, for communications on the chip, is a fixed value that can be known for a given pair of transmission and reception blocks 4. The compiler can know this from an inter-block delay correspondence table for
B17782 FR-409133FR different possible combinations of transmitter and receiver blocks. The compiler can then include the corresponding receive instruction, to listen to the address of the transmitter, at the level of the corresponding number of cycles after the transmit instruction.
Another reason for separating BSP into internal and external stages is that overall synchronization and exchange across multiple chips is going to be more expensive than just synchronization and exchange on a chip, the total cost being that of the internal synchronization mentioned above plus the additional delays necessary to aggregate this globally. Furthermore, although in embodiments the synchronization signaling itself does not need flow control and is therefore relatively fast, the external synchronization synchronizes in an external exchange. An external exchange undergoes a longer latency and a greater uncertainty compared to an internal exchange.
First, there is usually significantly less data bandwidth available between chips than on a chip. This is due to the fact that the density of inter-chip wiring is limited by the density of the connections of the housings (balls or studs), which is significantly lower than the density of wiring available on a chip. Thus, the communication of a fixed amount of data between chips will take significantly longer than on a chip, even if the transmission times are similar. Also, an external exchange is less local: wires go further and therefore are more capacitive and more vulnerable to noise. This in turn can cause losses and therefore require a flow control mechanism that provides retransmission at the physical layer level, leading to reduced overall throughput (and loss of
B17782 FR-409133FR temporal determinism - see below). Furthermore, in addition to a greater physical distance, the signaling and data transmitted between chips typically have to pass through a greater quantity of logic such as SerDes (serializers - de-serializers) and flow control mechanisms, all of this adding an additional delay compared to internal communications. For example, the inventors have identified that using conventional technologies, it can be expected that an external barrier synchronization process will take on the order of ten times longer than an internal synchronization, and may count for 5-10% of the program execution time. Using the hardware synchronization mechanism described here, this can be reduced to around three times, but is still slower than internal synchronization. In addition, data exchange externally goes, for example due to factors like losses and retransmissions at the physical layer due to noise, and / or serialization and de-serialization between chips.
In other variants, the interconnection between chips can be without losses at the level of the physical layer and / or the link layer, but is effectively with losses at the level of the upper network layer due to the congestion of the flows of network layer between different sources and destinations that cause queue overflow and packet loss. This is the way in which Ethernet works and it is envisaged that an alternative interconnection which is not deterministic over time can use Ethernet. It will also be noted that in any exchange process, whether it is lossless or lossy, there can actually be non-recoverable errors (for example due to alpha radiation) which cause a total failure of the exchange and which cannot be recovered
B17782 FR-409133FR by any hardware mechanism (for example the link layer).
Both in deterministic cases over time and in non-deterministic cases over time, in embodiments the system can detect but not correct such errors. Once detected, the error can be reported to the host, whose strategy may be to require the state of the BSP application to be checked periodically in the event of such a fatal hardware error, and to bring back the application status at the last checkpoint. By this mechanism, even lossy mechanisms used to perform data exchange can be given a lossless appearance from the user, with a certain cost on performance.
For all the above-mentioned reasons or others, it would be desirable to separate the BSP process into deterministic stages in time and non-deterministic stages in time, so as to prevent temporal determinism from at least certain deterministic exchanges. in time in deterministic domains in time are contaminated by non-deterministic exchanges in time between such domains.
Consequently, the program can be arranged to carry out a sequence of synchronizations, exchange phases and calculation phases comprising in the following order: (i) a first calculation phase, then (ii) synchronization with internal barrier 30, then (iii) an internal exchange phase 50, then (iv) an external barrier synchronization 80, then (v) an external exchange phase 50 '. See the chip of 211 in FIG. 13. The external barrier 80 is imposed after the internal exchange phase 50, so that the program does not carry out the external exchange 50 'until after the internal exchange 50. It will also be noted that as shown with regard to the chip of 21 in FIG. 12, optionally a phase
B17782 FR-409133FR of calculation can be included between the internal exchange (iii) and the external barrier (iv).
This global sequence is imposed by the program (for example by being generated in this way by the compiler), In embodiments the program is programmed to act in this way by means of the SYNC instruction described above. The internal synchronization and exchange does not extend to any block or other entity on another chip 2. The sequence (i) - (v) (with the optional calculation phase mentioned above between iii and iv) can be repeated in a series of global iterations. By iteration, there can be multiple instances of internal computation, synchronization and exchange (i) - (iii) before external synchronization and exchange. That is, multiple instances of (i) - (iii) (retaining this order), i.e. multiple internal BSP super-steps, can be implemented before (iv) - (v), i.e. external synchronization and exchange. It will also be noted that any of the blocks 4 can be in the process of carrying out its own instance of internal synchronization and exchange (ii) - (iii) in parallel with the other blocks 4.
Thus for each overall BSP cycle (i) - (v), it is ensured that there is at least part of the cycle (ii) - (iii) in which the synchronization is forced to be performed only so internal, i.e. on the chip.
Note that during an external exchange 50, communications are not limited to being only external. Some pavers can just perform internal exchanges, some can only perform external exchanges, and some can mix. However, due to the loss of temporal determinism which occurs in the external interconnection 72 in certain modes of
B17782 FR-409133FR realization, in such an embodiment, once a block has made an external communication, it cannot carry out an internal communication again before the following synchronization (see below an explanation of the preferred mechanism of communication on the chip which is based on a predetermined knowledge of temporal characteristics of messages and inter-block delays.
In some embodiments, as also shown in FIG. 13, certain blocks 4 can perform local inputs / outputs during a calculation phase, for example they can exchange data with a host.
It will also be noted that, as shown in FIG. 13, it is generally possible for any one or all of the blocks to have a zero calculation phase 52 or a zero exchange phase 50 in any BSP super-step. given.
In embodiments, the different levels of synchronization zones 91, 92 can be used to constrain the scope of some of the synchronization and external exchange operations to only a subgroup of the chips 2 in the system, and limit the number of times the penalty for synchronization and full global exchange is required. That is to say that the global cycle can include: (i) a first calculation phase, then (ii) an internal barrier synchronization, then (iii) an internal exchange phase, then (iv) a synchronization with an external barrier 80 in the blocks of only a first lower level synchronization zone 91; then (v) an external exchange phase between only the chips of the first synchronization zone 91; then (vi) an external barrier synchronization throughout a second upper level synchronization zone 92;
B17782 FR-409133FR then (vii) an external exchange phase between the chips of the second level synchronization zone 92. The external barrier towards the exchange phase of the second level is imposed after the external exchange phase of the first level, so that the program does not exchange the second level externally until after the exchange phase of the first level. This behavior can be programmed using the SYNC instruction qualified by different levels of the external mode in its operand.
In embodiments, the highest hierarchical level of synchronization zone encompasses all of the blocks 4 on all the chips 2 in the matrix 6, that is to say that it is used to achieve overall synchronization. When multiple lower level zones are used, BSP can be enforced internally between groups of tiles 4 on the chip (s) 2 in each zone, but each zone can operate asynchronously with respect to another until that a global synchronization is executed.
Note: with regard to synchronization and external exchange of lower level (iv) - (v), any of the lower level zones 91A, 91B may be carrying out its own instance d 'lower level external exchange in parallel with the other lower level zone (s). And / or, in some cases multiple instances of (i) - (v) can be implemented before (vi) (vii), i.e. there can be multiple instances of the super - lower level external BSP stage before external synchronization and exchange. In addition, the scheme could be extended to three or more hierarchical levels of synchronization area.
B17782 FR-409133FR [0177] The following describes an example of a mechanism for communicating on a chip (internal exchange) without the need for queues. Reference is made to Figure 16.
On each chip 2, the chip 2 includes a respective clock which controls the time operation of the activity of the chip. The clock is connected to all the circuits and components of the chip. The chip 2 also includes the deterministic internal interconnection in time or switching fabric 34 to which all the blocks and all the connections are connected by sets of connection wires. In embodiments, the interconnection 34 may be stateless, in that it does not have a software-readable state. Each set of connection wires is fixed from end to end. The wires are in pipeline. In this embodiment, a set includes thirty-two wires. Each set can transport a packet consisting of one or more 32-bit data, one data being transferred at each clock cycle. It will be noted here that the word packet represents a set of bits representing a data item (sometimes called here data element), perhaps with one or more valid bits. Packets do not include a header or any form of destination identifier (which allows a intended recipient to be uniquely identified), nor do they include end-of-packet information. Instead, each represents a numeric entry or exit value from a block. Each block has its own local memory (described below). The chip does not have shared memory. The switching fabric constitutes only a crossed set of connecting wires and also contains no states. The exchange of data between blocks on the same chip is carried out in a deterministic manner over time as has been described here. A pipeline connection wire comprises a series of temporary storage elements, for example
B17782 FR-409133FR Locks or flip-flops that hold data during a clock cycle before releasing it to the next storage element. The travel time along the wire is determined by these temporary storage elements, each using the time of a clock cycle in a path between any two points.
Each block 4 indicates its synchronization state to the synchronization controller 36 in the interconnection 34. Once it has been established that each block 4 is ready to send data, the synchronization process 30 brings the system to enter the exchange phase 50. It will be noted that each block undergoes the synchronization acknowledgment with a different but known delay. The supervisor program inserts additional delay cycles as necessary so that each block begins its exchange phase exactly on the same cycle. In this exchange phase, data values move between blocks (in fact between the block memories in a data movement from memory to memory). In the exchange phase, there are no calculations and therefore no danger of simultaneity (or at least there are no calculations based on data that remains to be received from another block 4 ). In the exchange phase, each piece of data moves along the connection wires on which it leaves a block between a transmitting block and its receiving block. At each clock cycle, the data travels a certain distance along its path (from element to storage element), in a pipeline. When data is sent by a block, it is not sent with a header identifying a receiver block. Instead, the receiving pad knows that it will wait for data from a certain sending pad at some point. Thus, the computer described here is deterministic over time.
B17782 FR-409133EN [0180] Each block 4 executes a portion of the program which has been allocated to it by the programmer or by a compiler exercise, in which the programmer or the function of the compiler has knowledge of what will be emitted by a block particular at a certain time and what should be received by a receiving pad at a certain time. In order to achieve this, SEND instructions are included in the local programs executed by the processor on each block, the execution time of the SEND instruction being predetermined with respect to the time sequence of other instructions which are executed on other blocks in the computer.
Each block 4 is associated with its own multiplexer 210. Each multiplexer has at least as many inputs as there are blocks 4 on the chip, each input being connected to the switching fabric 34. The crossed wires of the fabric switching cables are connected to a set of data output connection wires 218 coming from each block (a broadcast exchange bus). To facilitate the illustration, all the crossed wires are not represented in FIG. 16. A set of crossed wires is referenced 140x to indicate that it is one of a certain number of sets of crossed wires [0182] When the multiplexer 210 is switched to the input referenced 220x, this will connect it to the crossed wires 140x and thus to the data bus 218T of the transmitter (sending) block 4T. If the multiplexer is controlled to switch to this input at a certain time, then the data received on the data bus 230 which is connected to the crossover wire 140x will appear at the output of the multiplexer 210 at a certain time. It will arrive at the level of the receiver block 4R after a certain delay, the delay depending on the distance between the multiplexer 210 and the receiver block 4R. Since multiplexers tend to be arranged close to the fabric of
B17782 FR-409133FR switching, the delay between the pad and the multiplexer may vary depending on the location of the 4R receiver pad.
To implement the switching, the local programs executed on blocks 4 include switch control instructions (PUTi) which cause the transmission of a multiplexer control signal 214 to control the multiplexer 210 associated with this. block to switch its input a certain time before the moment when a particular datum is expected to be received at the level of the block. In the exchange phase, multiplexers are switched and packets (data) are exchanged between blocks using the switching fabric. We can see from this explanation that the internal interconnection 34 has no state and does not need any queue - the movement of each data is predetermined by the particular cross wire to which the input of each multiplexer is connected.
In the exchange phase, all the blocks 4 are authorized to communicate with all the other blocks in its synchronization group. Each block 4 has control of its own unique input multiplexer 210. Incoming traffic can thus be selected from any other block in chip 2 (or from one of the external connection links in an external exchange). It is also possible that a multiplexer 210 is set to receive a zero input, that is to say no input, in any given exchange phase.
Each block 4 has three interfaces: an exin interface 224 which passes data from the switching fabric 34 to block 4; an exout interface 226 which passes data from the pad to the switching fabric on the broadcast exchange bus 218; and an exmux 228 interface
B17782 FR-409133FR which passes the multiplexer control signal 214 (muxselect) from block 4 to its multiplexer 210.
In order to ensure that each individual block executes SEND instructions and switch control instructions at appropriate times to transmit and receive the correct data, exchange planning requirements must be met by the programmer or compiler which allocates individual programs to individual tiles on the computer. This function is performed by an exchange scheduler, preferably at the time of compilation, which needs to know the following parameters.
Parameter I: the relative SYNC acknowledgment delay of each block, RSAK (TID block transmitter, TID block receiver). It is a function of the block ID (TID) of the transmitter and receiver blocks, which is maintained in the TILE_ID register. It is expressed in number of cycles, always greater than or equal to 0 indicating when each block receives the synchronization acknowledgment signal from the synchronization controller 36 with respect to all the other blocks. This can be calculated from the block ID, noting that the block ID indicates the particular location on that block's chip, and therefore reflects physical distances. In other words, the synchronization acknowledgment times are equalized. If the transmission block 4T is closer to the synchronization controller 36 and the reception block 4R is further away, the consequence is that the synchronization acknowledgment delay will be shorter towards the transmitter block 4T than for the block 4R receiver, and vice versa. A specific value will be associated with each block for the synchronization acknowledgment delay. These values can be maintained for example in a time table, or
B17782 FR-409133FR can be calculated on the fly each time based on the keypad ID.
Parameter II: the exchange multiplexer control loop delay, MXP (TID of the receiver block). It is the number of cycles between the issuance of an instruction (PUTi MUXptr) which modifies a selection of block input multiplexer and the most advanced point where the same block could issue a loading instruction (hypothetical) to exchange data stored in memory as a result of the new multiplexer selection. This includes the delay for the control signal to pass from the exmux interface 228R of the destination block 4R to its multiplexer 210R and the length of the line between the output of the multiplexer and the exin data input interface 224.
Parameter III: the exchange delay from block to block, TT (TID of transmitter block, TID of receiver block). It is the number of cycles between a SEND instruction which is emitted on a block and the most advanced point where the receiving block could emit a loading instruction (hypothetical) pointing to the value sent in its own memory. This can be calculated from the TIDs of the transmitting and receiving blocks, either by accessing a table or by calculating on the fly. This delay includes the time taken by data to move from the transmission block 4T from its exout interface 226T to the switching fabric 14 by following its exchange bus 218T and then via the input multiplexer 210R to the level of the 4R receiver block to the exin 224R interface of the receiver block.
Parameter IV: the update time of the exchange traffic memory pointer, MMP (). It is the number of cycles between the issuance of an instruction (PUTi MEMptr) which modifies an exchange input traffic memory pointer 232 of a
B17782 FR-409133FR block and the most advanced point where this same block could issue a loading instruction (hypothetical) for exchange data stored in memory as a result of the new pointer. It is a small fixed number of cycles. The memory pointer 232 behaves like a pointer in the data memory 202 and indicates where incoming data from the exin interface 224 will be stored.
These parameters together give the total inter-block delay between the transmission of data from the 4T transmitter block and the reception of this data by the 4R receiver block. The particular exchange mechanism and the above parameters are given only by way of example. Different exchange mechanisms may differ in the exact composition of the delay, but as long as the exchange is kept deterministic over time, it can be known by the programmer or the compiler and thus an exchange without queues is possible.
FIG. 17 shows the example of the timing of the exchanges in more detail. On the left side are represented the clock cycles of the chip going from 0 to 30. The action on the 4T transmitter pad occurs between clock cycles 0 and 9, starting with the emission of an instruction of sending (SEND E0). In clock cycles 10 to 24, the data travels in pipeline through the switching fabric 34.
Looking at the receiver block 4R in the IPU clock cycle 11, it can be seen that a PUTi instruction is executed and changes the selection of the input multiplexer of the block. In cycle 18, the memory pointer instruction is executed authorizing a loading instruction in the clock cycle 25. On the transmitter block 4T, cycles 1 to 9 are an internal block delay between the transmission of a instruction
B17782 FR-409133FR
SEND and the manifestation of this data on the interface exout. El, E2 represent data coming from a previous SEND instruction. In the exchange fabric 34, the clock cycles 10 to 24 are denoted exchange. In each of these cycles, data moves one step in the pipeline (between temporary storage). Cycles 25 to 29 on the receiver block 4R represent the delay between the reception of data at the level of the exin interface and the coding of the latter in memory.
In simple terms, if the processor of the 4R receiver block wants to act on a datum which was the output of a process on the 4T transmitter block, then the 4T transmitter block must execute a SEND instruction sent at a certain time (by example the clock cycle 0 in FIG. 17), and the receiver block 4R must execute a switch command instruction PUTi EXCH MXptr (as in clock cycle 11) with a certain delay relative to the execution of the 'SEND instruction on the transmitter keypad. This will ensure that data arrives at the receiver block in time to be loaded for use in a codelet running at the receiver block 4R.
Note that the reception process at the receiver block does not need to involve the establishment of the memory pointer as with the PUTi MEMptr instruction. Instead, the memory pointer 232 automatically increments after receiving each data at the exin interface 224. The received data is then just loaded into the next available memory location. However, the possibility of changing the memory pointer allows the receiver pad to alter the memory location where the data is written. All of this can be determined by the compiler or programmer who writes the individual programs in the individual blocks so that they communicate properly. it
B17782 FR-409133FR brings the temporal characteristics of an internal exchange (the inter-block exchange on a chip) to be completely deterministic in time. This temporal determinism can be used by the exchange scheduler to very strongly optimize exchange sequences.
FIG. 19 illustrates an example of a mechanism for communicating off-chip (external exchange). This mechanism is non-deterministic over time. The mechanism is implemented in dedicated hardware logic in the external interconnection 72. Data is sent over the external interconnection 72 in the form of packets. Unlike packets sent over the internal interconnect, these packets have headers: since the order of transmission can change, they need the destination address to be present in the packet header. Also, in embodiments the external interconnection 72 takes the form of a network and therefore requires additional information for the purpose of routing.
At the physical layer the interconnection mechanism is lossy, but at the transaction layer the mechanism is lossless due to the architecture of the link layer: if a packet does not provide acknowledgment, it will be retransmitted automatically by the hardware in interconnect 72. The possibility of loss and retransmission at the data link layer, however, means that the delivery of data packets over the external interconnect is not deterministic over time. In addition, all packets in a given exchange can arrive together or separate in time, and in any order, so the external interconnection requires flow control and the use of queues. In addition, the interconnection can use data recovery technology
B17782 FR-409133FR Clock (CDR) for inferring a clock from a received data stream having enough data signal transitions to maintain bit lock. This inferred clock will have an unknown phase relationship with the transmit clock and therefore represents an additional source of non-determinism.
As illustrated, the external interconnection 72 includes an external exchange block (XB) 78. The compiler names one of the blocks 4 to send an external exchange request (XREQ) to the exchange block 78 (step SI). The XREQ request is a message comprising one or more command packets, indicating which of the blocks 4 has data packets (content) to send to another block or to other blocks 4 on another chip 2. This is illustrated schematically in Figure 19 by check marks and crosses: as an example of a scenario, those marked with a check mark have data packets to send to the outside and those marked with a cross do not. In step S2, the exchange block 78 sends an exchange start command packet (XON) to a first of the blocks 4 with data to be sent to the outside. This causes the first block to start sending its packets to the destination concerned via the external interconnection 78 (step S3). If at any time the XB is unable to continue sending packets to the interconnect (e.g. due to previous packet loss and retransmission in the interconnect, or due to overbooking external interconnection by many other XBs and blocks) the XB will send an exchange stop (XOFF) to this block before the XB queue overflows. Once the congestion is cleared and the XB has enough space in its queue again, it will send an XON to the keypad allowing it to continue transmitting its content.
B17782 FR-409133FR
Once this block has sent its last data packet, then at step S4 the exchange block 78 sends an exchange stop command packet (XOFF) to this block, then at step S5 sends another XON to the next block 4 with data packets to send, and so on. The signaling of XON and XOFF is implemented in the form of a hardware mechanism in dedicated hardware logic in the form of the external exchange block 78.
Note that this is only one example of a flow control mechanism for communicating externally between chips. Other suitable mechanisms will be familiar to those skilled in the art. Also, the possibility of a deterministic external interconnection over time and / or without queues is not excluded.
FIG. 15 illustrates an example of application of the processor architecture described here, namely an artificial intelligence application.
As is well known to those skilled in the art in the technique of artificial intelligence, artificial intelligence begins with a learning step in which the artificial intelligence algorithm learns a knowledge model. . The model includes a graph of interconnected nodes (that is to say vertices) 102 and of edges (that is to say of links) 104. Each node 102 in the graph comprises one or more edges of entry and one or more exit stops. Some of the input edges of some of the nodes 102 are the output edges of some of the other nodes, thereby connecting the nodes to each other to form the graph. In addition, one or more of the input edges of one or more of the nodes 102 form the inputs of the graph as a whole, and one or more of the output edges of one or more of the nodes 102 form the outputs of the graph
B17782 FR-409133FR as a whole. Sometimes a given node can even have it all: inputs from the graph, outputs from the graph and connections to other nodes. Each stop 104 communicates a value or more often a tensor (n-dimensional matrix), this forming the inputs and outputs supplied to the nodes and obtained from the nodes 102 on their input and output edges respectively.
Each node 102 represents a function of its one or more inputs received on its input edge (s), the result of this function being the output (s) provided on the output edge (s). Each function is parameterized by one or more respective parameters (sometimes called weights or weights, although they do not necessarily have to be multiplier weights). In general, the functions represented by the different nodes 102 can take different forms of function and / or can be parameterized by different parameters.
Furthermore, each of said one or more parameters of each function of a node is characterized by a respective error value. In addition, a respective condition can be associated with the error or errors in the parameter or parameters of each node 102. For a node 102 representing a function parameterized by a single parameter, the condition can be a simple threshold, that is that is, the condition is satisfied if the error is within the specified threshold but is not satisfied if the error is beyond the threshold. For a node 102 configured with more than one respective parameter, the condition for this node 102 to have reached an acceptable level of error can be more complex. For example, the condition can be satisfied only if each of the parameters of this node 102 remains below the respective threshold. In another example, a combined metric can be defined as combining the errors in the different
B17782 FR-409133FR parameters for the same node 102, and the condition can be satisfied if the value of the combined metric remains below a specified threshold, but otherwise the condition is not satisfied if the value of the combined metric is beyond the threshold (or vice versa depending on the definition of the metric). Whatever the condition, this gives a measure of whether the error in the node parameter (s) remains below a certain level or degree of acceptability. In general any suitable metric can be used. The condition or the metric can be the same for all the nodes, or can be different for certain respective different nodes.
In the learning step, the algorithm receives experimental data, that is to say multiple data points representing different possible combinations of entries in the graph. As experimental data is received, the algorithm gradually adjusts the parameters of the various nodes 102 of the graph on the basis of the experimental data so as to try to minimize errors in the parameters. The goal is to find parameter values such that the output of the graph is as close as possible to a desired output for a given input. When the graph as a whole tends towards such a state, we say that the graph converges. After an appropriate degree of convergence the graph can be used to make predictions or inferences, that is, to predict an exit for a certain given entry or to infer a cause for a certain given exit.
The learning stage can take a number of different possible forms. For example, in a supervised approach, the experimental input data takes the form of training data, that is to say inputs which correspond to known outputs. With each
B17782 FR-409133FR data point, the algorithm can adjust the parameters so that the output more closely matches the known output for the given input. In the next prediction stage, the graph can then be used to map an input query to an approximate predicted output (or vice versa if an inference is made). Other approaches are also possible. For example, in an unsupervised approach, there is no concept of reference result per input data, and instead we let the artificial intelligence algorithm identify its own structure in the output data. Or, in a reinforcement approach, the algorithm tries at least one possible output for each data point in the experimental input data, and it is told whether its output is positive or negative (and potentially the degree to which it is positive or negative), e.g. won or lost, or reward or punishment, or the like. On many tests, the algorithm can gradually adjust the parameters of the graph to be able to predict inputs which will lead to a positive output. The various approaches and algorithms for learning a graph are known to those skilled in the art in the field of machine learning.
According to an example of application of the techniques described here, each work thread is programmed to perform the calculations associated with a respective individual node among the nodes 102 in an artificial intelligence graph. In this case, at least some of the edges 104 between the nodes 102 correspond to the exchanges of data between wires, and some may involve exchanges between blocks. In addition, the individual output states of the work threads are used by the programmer to represent whether or not the respective node 102 has satisfied its respective condition for
B17782 FR-409133FR the convergence of the parameter (s) of this node, that is to say if the error in the parameter (s) remains in the acceptable level or region in the error space. For example, there is an example of using the embodiments in which each of the individual output states is an individual bit and the aggregate output state is an AND of the individual output states (or equivalently an OR if 0 is taken as positive); or wherein the aggregate output state is a trinary value representing whether the individual output states were all true, all false or mixed. Thus, by examining a single register value in the output state register 38, the program can determine whether the whole graph, or at least one sub-region of the graph, has converged to an acceptable degree.
[0207] In another variant of this, it is possible to use embodiments in which the aggregation takes the form of a statistical aggregation of individual confidence values. In this case, each individual output state represents a confidence (for example a percentage) that the parameters of the node represented by the respective thread have reached an acceptable degree of error. The aggregate output state can then be used to determine an overall confidence level indicating whether the graph, or a sub-region of the graph, has converged to an acceptable degree.
In the case of a multi-block arrangement 6, each block executes a subgraph of the graph. Each subgraph includes a supervisor routine comprising one or more supervisor threads, and a set of work threads in which some or all of the work threads can take the form of codelets.
Note that the above embodiments have been described only by way of example.
B17782 FR-409133EN [0210] For example, the concept of separation of internal BSP phases, deterministic in time, and external, non-deterministic in time, is not limited to an implementation using the dedicated synchronization instruction of the embodiments described above. Although this is particularly effective, it is not excluded that the internal-external BSP principle can instead be implemented in code made up of general purpose machine code instructions.
Also, the field of the present description is not limited to the fact that the deterministic domain over time is on the chip or that the non-deterministic exchanges over time are specifically outside the chip. It would also be possible to make a separation between the deterministic spheres in time and non-deterministic in time in other ways. For example, it is not excluded to extend the deterministic domain in time across the set of multiple chips 2, with different deterministic domains in time multi-chips which are connected by a non-deterministic interconnection in time (by example, the different deterministic domains in multi-chip time being implemented on different cards or different server chassis). Or in another example, different deterministic domains over time could be implemented on a given chip 2, an interconnection on the non-deterministic chip in time being provided or ensured between such domains.
In addition, the implementation of the deterministic domain over time is not limited to the use of a correlation table of inter-block delays. Instead, for example, an analytical formula could be used to determine the inter-block delay. In addition, the inter-block delay and the time characteristics of sending and receiving
B17782 FR-409133FR are not limited to being defined by the compiler. For example, alternatively they could be defined manually by the programmer.
In addition, the scope of this description is not limited for any particular reason to make the separation between the deterministic spheres in time and non-deterministic in time. As described above, there are many potential pilots to achieve this: queuing, lossy transmission or lossless transmission, latency, and / or separation on the chip / off the chip. In such cases or in other cases, it may be desirable, at least in certain phases, to avoid that non-deterministic exchanges over time pollute the temporal determinism of a deterministic exchange phase over time. The scope of this description is not limited by any possible motivation.
In addition, the applicability of the techniques described here is not limited to the architecture described above, in which a separate context is provided for the supervisor thread, or in which the supervisor thread is executed in a slot then abandons its niche to a thread of work. In another arrangement, for example, the supervisor can run in his own dedicated window. Or the internal-external BSP concept can even be used in scenarios where one, some or all of the blocks on a single, some or all of the chips use a non-multi-thread execution.
Where multi-thread blocks are used, the terms supervisor and work thread do not imply specific responsibilities unless this is explicitly mentioned, and in particular are not necessarily limited in themselves to the diagram described above in which a supervisor thread abandons its time slot
B17782 FR-409133FR to a working wire and so on. In general, a work thread can designate any thread to which a calculation task is allocated. The supervisor can represent any kind of supervision or coordination wire responsible for actions such as: assigning working wires to barrel slots, and / or performing barrier synchronizations between multiple wires, and / or performing any control operation flux (as a connection) depending on the output of more than a single wire.
When reference is made to a sequence of interlaced time slots, or the like, this does not necessarily imply that the sequence mentioned consists of all the possible or available slots. For example, the sequence in question could consist of all possible slots or only those which are currently active. It is not necessarily excluded that there may be other potential slots which are not currently included in the planned sequence.
The term block as used here is not necessarily limited to a particular topography or the like, and in general can denote any modular unit of processing resources comprising a processing unit 10 and a corresponding memory 11, in a matrix of similar modules, typically on the same chip (the same elementary chip).
Furthermore, when reference is made to achieving synchronization or aggregation in a group of tiles, or between a plurality of tiles or the like, this need not necessarily designate all the tiles on the chip or all the blocks in the system unless explicitly stated. For example, the SYNC and EXIT instructions could be arranged to achieve synchronization and aggregation
B17782 FR-409133FR only in relation to a certain subset of blocks 4 on a given chip and / or only a subset of chips 2 in a given system; while certain other blocks 4 on a given chip, and / or certain other chips in a given system, may not be involved in a given BSP group, and could even be used for a completely separate set of tasks not related to computation which is made by the group at hand.
In addition, the synchronization diagrams described here do not exclude the implication, in embodiments, of external resources other than multi-block processors, for example a CPU processor like the host processor, or even a or more components that are not processors like one or more network cards, storage devices and / or FPGAs (for example to communicate global synchronization messages in the form of packets on a wider interconnection rather than on dedicated wires used only for synchronization purposes). For example, certain blocks may choose to engage in data transfers with an external system, these transfers forming the computational load of this block. In this case, transfers should be completed before the next barrier. In some cases, the exit status of the block may depend on the result of the communication with the external resource, and this resource may indirectly influence the exit state. Alternatively or in addition, resources other than multi-block processors, for example the host or one or more FPGAs, could be incorporated into the synchronization network itself. That is, a synchronization signal such as a Sync_req is required from this or these additional resources so that the synchronization barrier is satisfied and the blocks proceed to the next exchange phase. In addition, in modes of
B17782 FR-409133FR realizing the aggregate global output state can include in the aggregation an output state of the external resource, for example coming from an FPGA.
Also, while certain SYNC instruction modes have been described here, the scope of the present description more generally is not limited to such modes. For example, the list of modes given previously is not necessarily exhaustive. Or in other embodiments, the SYNC instruction may have fewer modes, for example the SYNC does not need to support different hierarchical levels of external synchronization, or does not need to distinguish between synchronizations on the chip and between chips (that is to say in an inter-block mode, always acts in relation to all the blocks whether on the chip or off-chip). In still other alternative embodiments, the SYNC instruction does not need to take a mode as an operand at all. For example, in some embodiments separate versions of the SYNC instruction (different operation codes) may be provided for different levels of synchronization and aggregation of output states (such as different SYNC instructions for synchronization on the blocks and inter-block, synchronization on the chips). Or in other embodiments, a dedicated SYNC instruction can be provided only for inter-block synchronizations (leaving synchronization at the block level between the wires, if necessary, to be carried out by general-purpose software).
In still other variants, the SYNC instruction could take a larger number of possible modes to support a greater granularity or a greater range of hierarchical synchronization zones 91, 92; or simply a different set of modes to support different divisions of the system into zones
100
B17782 FR-409133FR hierarchical. For example, in addition to allowing selection between internal (on-chip) and external (off-chip) synchronization (or even as an alternative to this), the modes of the SYNC instruction can be arranged to recognize d '' other more distant physical break points beyond a given chip (for example an IC box, a card, a card box, etc.). Or even if no dedicated SYNC instruction is used, such divisions can be implemented by the programmer or compiler using general purpose code. Thus, in embodiments, one of the hierarchical synchronization zones (for example one of the modes of the SYNC instruction) can be made up of all the blocks on all the chips located on the same IC box (but none of the paving stones or any of the chips beyond). In addition or instead, one of the hierarchical synchronization zones (for example here again one of the modes of the SYNC instruction) can consist of all the blocks on all the chips being on the same card (but none of the pavers, none of the chips or any of the boxes beyond). In another variant or additional example, one of the hierarchical synchronization zones (for example here again another possible mode of the SYNC instruction) could consist of all the blocks located on all the chips on all the cards in the same physical box, for example the same server chassis (but none of the blocks, none of the chips or boxes beyond). This could be advantageous since communication between different server chassis will tend to suffer a greater penalty than just between chips (elementary chips) that are located in the same chassis.
In addition, the synchronization zones are not limited to being hierarchical (that is to say nested one inside the other), and in other embodiments,
101
B17782 FR-409133EN the selectable synchronization zones can be made up of or include one or more non-hierarchical groups (all the tiles of this group not nested in a single other selectable group).
Other applications and variants of the techniques described may appear to those skilled in the art with the description given here. The field of the present description is not limited by the embodiments described but only by the appended claims.
权利要求:
Claims (20)
[1" id="c-fr-0001]
1. Method for actuating a system comprising multiple processor blocks divided into a plurality of domains, in which in each domain the blocks are connected to each other by means of a respective instance of a deterministic interconnection over time , and between the domains, the blocks are connected to each other via a non-deterministic interconnection over time; the process comprising:
on each respective block of a group participating in some or all of the blocks in all of the fields, perform a computation stage in which the respective block performs one or more respective calculations on the block, but does not communicate the results of calculation neither to nor from any of the other cobblestones in the group;
in each respective area of said one or more areas, perform a synchronization with respective internal barrier to impose that all the participating tiles in the respective area have completed the calculation phase before any of the participating tiles in the respective area is authorized to carry out an internal exchange phase, thus establishing a common time reference between all the participating blocks internally in each individual domain of said one or more domains;
following the respective internal barrier synchronization, carry out the internal exchange phase in each of said one or more fields, in which each block participating in the respective field communicates one or more results of its respective calculations to and / or from one or more other paving stones among the participating paving stones in the same domain via the deterministic interconnection over time, but does not
103
B17782 FR-409133FR communicates calculation results neither to nor from any other of the said fields;
perform an external barrier synchronization to impose that all the participating blocks of said domains have completed their internal exchange phase before any of the participating blocks is authorized to carry out an external exchange phase, thereby establishing a reference for common time between all participating pavers in all areas; and following the external barrier synchronization, carry out the external exchange phase in which one or more of the participating blocks communicate one or more of the calculation results with another of the domains via the non-deterministic interconnection in time.
[2" id="c-fr-0002]
2. Method according to claim 1, in which the communications via the non-deterministic interconnection over time are queued, but the communications between blocks via the deterministic interconnection over time are not queued. waiting.
[3" id="c-fr-0003]
3. Method according to claim 1 or 2, in which over the deterministic interconnection over time, the communication between each pair of transmitter and receiver blocks is carried out by:
transmission of a message from the receiver pad, and control of the receiver pad to listen to an address of the transmitter pad at a predetermined time interval after transmission by the transmitter pad, in which the predetermined time interval is equal to one total predetermined delay between the sending block and the receiving block, the time interval being defined by a compiler having predetermined information on the delay.
104
B17782 FR-409133FR
[4" id="c-fr-0004]
4. Method according to any one of the preceding claims, in which the deterministic interconnection over time is lossless, while the non-deterministic interconnection over time is lossy at the level of a physical layer, of a layer transport or network layer.
[5" id="c-fr-0005]
5. Method according to any one of the preceding claims, in which each of the domains is a different respective chip, the deterministic interconnection over time being an internal interconnection on the chip and the non-deterministic interconnection over time being an external interconnection between the chips.
[6" id="c-fr-0006]
6. Method according to any one of claims 1 to 4, in which each of the fields comprises multiple chips, the deterministic interconnection over time being an external inter-chip interconnection without losses and the non-deterministic interconnection over time being an external interconnection with losses.
[7" id="c-fr-0007]
7. Method according to any one of the preceding claims, comprising carrying out a series of repetitive iterations, each comprising a respective instance of the computation stage, followed by a respective instance of internal barrier synchronization, followed a respective instance of the internal exchange phase, followed by a respective instance of the external barrier synchronization, followed by a respective instance of the external exchange phase; wherein each successive iteration is not allowed to proceed before the external barrier synchronization of the immediately preceding iteration has been achieved.
[8" id="c-fr-0008]
8. Method according to any one of the preceding claims, comprising the production of a sequence
105
B17782 FR-409133FR of instances of the calculation phase, each being followed by a corresponding instance of the internal exchange phase and then by a corresponding instance of internal barrier synchronization, in which the external barrier synchronization follows the last calculation phase in said sequence.
[9" id="c-fr-0009]
9. Method according to claims 7 and 8, wherein each of one or more of the iterations comprises a respective sequence of multiple instances of the calculation phase, each being followed by a corresponding instance of the internal exchange phase and then d a corresponding instance of internal barrier synchronization, in which the respective external barrier synchronization follows the last instance of the calculation phase in the respective sequence.
[10" id="c-fr-0010]
10. Method according to any one of the preceding claims, in which each of the synchronizations with internal and external barrier is carried out by executing a synchronization instruction comprising an operation code and an operand, in which the operand specifies a mode of the instruction. synchronization as either internal or external, and in which the operation code, when executed, causes hardware logic in the deterministic interconnection over time to coordinate the operation of internal barrier synchronization when the operand specifies the internal mode, and causes the hardware logic in the non-deterministic interconnection over time to coordinate the operation of external barrier synchronization when the operand specifies the external mode.
106
B17782 FR-409133FR
A method according to any one of the plurality of claims in each area comprising selecting predefined areas as comprising a set of tiles or one of a participant, a different set of the multiple areas.
[11" id="c-fr-0011]
11, in which the lower level mo ns which are nested in at
Method according to claim two less
10, zones of a zone and claims he or 12, wherein
The operand of
The synchronization instruction specifies which of an external barrier synchronization applies, each corresponding to zones.
[12" id="c-fr-0012]
14. Method according to claims 12 and 13, in which the variants of the external mode specify at least to which hierarchical level of zone the synchronization with external barrier applies.
[13" id="c-fr-0013]
15. Method according to claim 12 or according to any one of the dependent claims, in which the external synchronization and exchange comprise: firstly carrying out an external synchronization of a first level then a forced exchange in a first zone of a level lower hierarchical areas; and following synchronization and exchange of the first level, performing synchronization and external exchange of a second level in a second zone of a higher level of said zones.
[14" id="c-fr-0014]
16. Method according to any one of claims 11 to 15, in which one of the hierarchical zones is
107
B17782 FR-409133FR made up of all the blocks located on the same case hierarchical zones is made up of all the blocks located in the same map, but not beyond; and / or one of the hierarchical zones is made up of all the blocks located in the same frame but not beyond.
A method according to any one of the claims comprising
The execution of an operation instruction on one or
Instruction some of the blocks, the forbearance code causing the block or blocks on which it is executed to be excluded from the group.
[15" id="c-fr-0015]
18. Method according to any one of the preceding claims, in which in the external exchange phase, one or more of the participating blocks also communicate one or more of the calculation results to a host processor via the external interconnection, the host processor. being implemented on a separate host processor chip.
[16" id="c-fr-0016]
19. Method according to any one of the preceding claims, in which, in the calculation phase, some or all of the participating blocks execute a batch of working threads in an interlaced manner, and the synchronization with internal barrier requires that all the threads of work in all lots are out.
[17" id="c-fr-0017]
20. Method according to any one of the preceding claims, comprising the use of the system to perform an artificial intelligence algorithm in which each node in a graph has one or more respective input vertices and one or more vertices of respective outputs, the input vertices of at least some of the nodes being the output vertices of at least some others of the nodes, each node comprising a respective function connecting its output vertices to its
108
B17782 FR-409133FR input vertices, each respective function being parameterized by one or more respective parameters, and each of the respective parameters having an associated error, so that the graph converges to a solution when the errors in some or all of the parameters are reduced;
in which each of the blocks models one or more respective nodes of the nodes of the graph.
[18" id="c-fr-0018]
21. The method of claim 5 or 6, and according to claims 18 and 19, wherein the chips are AI accelerator chips assisting the host processor.
[19" id="c-fr-0019]
22. Computer program product incorporated on a storage element readable by a computer and comprising code arranged to, when executed on the paving stones, perform operations according to any one of claims 1 to 21.
[20" id="c-fr-0020]
23. System comprising multiple processor blocks divided into a plurality of domains, in which in each domain the blocks are connected to each other via a respective instance of a deterministic interconnection in time and between the fields the blocks are connected to each other via a non-deterministic interconnection over time; the system being programmed to perform the following operations:
on each respective block of a group participating in some or all of the blocks in all of the fields, perform a computation stage in which the respective block performs one or more respective calculations on the block, but does not communicate the results of calculation neither to nor from any of the other cobblestones in the group;
in each respective area of said one or more areas, achieve internal barrier synchronization
109
B17782 FR-409133FR respectively to require that all the participating tiles in the respective domain have completed the calculation phase before any of the participating tiles in the respective domain is authorized to carry out an internal exchange phase, thereby establishing a common time reference between all the blocks participating internally in each individual domain of said one or more domains;
following the respective internal barrier synchronization, carry out the internal exchange phase in each of said one or more fields, in which each block participating in the respective field communicates one or more results of its respective calculations to and / or from one or more other blocks among the participating blocks in the same domain via the deterministic interconnection over time, but does not communicate calculation results either to or from any other of said domains;
perform an external barrier synchronization to impose that all the participating blocks of said domains have completed their internal exchange phase before any of the participating blocks is authorized to carry out an external exchange phase, thereby establishing a reference for common time between all the participating blocks in all of the fields and following synchronization with an external barrier,
achieve the external exchange phase in which a or many of the participating pavers communicate a or many of the calculation results with another of the areas through the intermediary of 1 interconnection no
deterministic over time.
类似技术:
公开号 | 公开日 | 专利标题
FR3072797A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVING AND MULTI-CHIP TREATMENT ARRANGEMENT
FR3072801A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESS MATRIX
FR3072800A1|2019-04-26|SYNCHRONIZATION IN A MULTI-PAVEMENT PROCESSING ARRANGEMENT
US11023290B2|2021-06-01|Synchronization amongst processor tiles
KR102161683B1|2020-10-05|Synchronization with a host processor
FR3072799A1|2019-04-26|COMBINING STATES OF MULTIPLE EXECUTIVE WIRES IN A MULTIPLE WIRE PROCESSOR
FR3072798A1|2019-04-26|ORDERING OF TASKS IN A MULTI-CORRECTION PROCESSOR
US10817444B2|2020-10-27|Sending data from an arrangement of processor modules
FR2578071A1|1986-08-29|MULTITRAITE INSTALLATION WITH SEVERAL PROCESSES
KR20190044566A|2019-04-30|Synchronization in a multi-tile processing arrangement
US10949266B2|2021-03-16|Synchronization and exchange of data between processors
FR3090924A1|2020-06-26|EXCHANGE OF DATA IN A COMPUTER
同族专利:
公开号 | 公开日
TWI673649B|2019-10-01|
US11023413B2|2021-06-01|
US20190121784A1|2019-04-25|
JP2019079526A|2019-05-23|
WO2019076715A1|2019-04-25|
TW201928665A|2019-07-16|
KR20190044568A|2019-04-30|
JP6797880B2|2020-12-09|
US10579585B2|2020-03-03|
DE102018126005A1|2019-04-25|
KR102263078B1|2021-06-09|
CA3021409A1|2019-04-20|
CA3021409C|2021-02-02|
US20200133914A1|2020-04-30|
GB2569775A|2019-07-03|
GB2569775B|2020-02-26|
GB201717294D0|2017-12-06|
CN110121699A|2019-08-13|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US6282583B1|1991-06-04|2001-08-28|Silicon Graphics, Inc.|Method and apparatus for memory access in a matrix processor computer|
US5649102A|1993-11-26|1997-07-15|Hitachi, Ltd.|Distributed shared data management system for controlling structured shared data and for serializing access to shared data|
JPH07234842A|1994-02-22|1995-09-05|Fujitsu Ltd|Parallel data processing system|
GB2303274B|1995-07-11|1999-09-08|Fujitsu Ltd|Switching apparatus|
US5887143A|1995-10-26|1999-03-23|Hitachi, Ltd.|Apparatus and method for synchronizing execution of programs in a distributed real-time computing system|
JP3532037B2|1996-07-31|2004-05-31|富士通株式会社|Parallel computer|
US6771264B1|1998-08-20|2004-08-03|Apple Computer, Inc.|Method and apparatus for performing tangent space lighting and bump mapping in a deferred shading graphics processor|
US7100021B1|2001-10-16|2006-08-29|Cisco Technology, Inc.|Barrier synchronization mechanism for processors of a systolic array|
US8307194B1|2003-08-18|2012-11-06|Cray Inc.|Relaxed memory consistency model|
US7437521B1|2003-08-18|2008-10-14|Cray Inc.|Multistream processing memory-and barrier-synchronization method and apparatus|
US7861060B1|2005-12-15|2010-12-28|Nvidia Corporation|Parallel data processing systems and methods using cooperative thread arrays and thread identifier values to determine processing behavior|
US7636835B1|2006-04-14|2009-12-22|Tilera Corporation|Coupling data in a parallel processing environment|
US7577820B1|2006-04-14|2009-08-18|Tilera Corporation|Managing data in a parallel processing environment|
US7761485B2|2006-10-25|2010-07-20|Zeugma Systems Inc.|Distributed database|
US8234652B2|2007-08-28|2012-07-31|International Business Machines Corporation|Performing setup operations for receiving different amounts of data while processors are performing message passing interface tasks|
US8078834B2|2008-01-09|2011-12-13|Analog Devices, Inc.|Processor architectures for enhanced computational capability|
US8866827B2|2008-06-26|2014-10-21|Microsoft Corporation|Bulk-synchronous graphics processing unit programming|
US7978721B2|2008-07-02|2011-07-12|Micron Technology Inc.|Multi-serial interface stacked-die memory architecture|
US8151088B1|2008-07-08|2012-04-03|Tilera Corporation|Configuring routing in mesh networks|
US20100115236A1|2008-10-31|2010-05-06|Cray Inc.|Hierarchical shared semaphore registers|
GB2471067B|2009-06-12|2011-11-30|Graeme Roy Smith|Shared resource multi-thread array processor|
CN101586961B|2009-07-06|2011-04-06|中国人民解放军国防科学技术大学|Multitask dispatching method for combined navigation handler and combined navigation handler|
GB201001621D0|2010-02-01|2010-03-17|Univ Louvain|A tile-based processor architecture model for high efficiency embedded homogenous multicore platforms|
US8407428B2|2010-05-20|2013-03-26|Hicamp Systems, Inc.|Structured memory coprocessor|
JP5549575B2|2010-12-17|2014-07-16|富士通株式会社|Parallel computer system, synchronization device, and control method for parallel computer system|
US20120179896A1|2011-01-10|2012-07-12|International Business Machines Corporation|Method and apparatus for a hierarchical synchronization barrier in a multi-node system|
CN103108000B|2011-11-09|2016-08-10|中国移动通信集团公司|Host node in the method and system and system of tasks synchronization and working node|
JP5974703B2|2012-07-20|2016-08-23|富士通株式会社|Information processing apparatus and barrier synchronization method|
US9465432B2|2013-08-28|2016-10-11|Via Technologies, Inc.|Multi-core synchronization mechanism|
US9977676B2|2013-11-15|2018-05-22|Qualcomm Incorporated|Vector processing engines employing reordering circuitry in data flow paths between execution units and vector data memory to provide in-flight reordering of output vector data stored to vector data memory, and related vector processor systems and methods|
JP6007430B2|2015-05-20|2016-10-12|大澤 昇平|Machine learning model design support device, machine learning model design support method, program for machine learning model design support device|
US10210134B2|2015-05-21|2019-02-19|Goldman Sachs & Co. LLC|General-purpose parallel computing architecture|
EP3365743A4|2015-10-21|2019-06-26|Advanced Micro Devices, Inc.|Droop detection and regulation for processor tiles|
US10248177B2|2015-05-22|2019-04-02|Advanced Micro Devices, Inc.|Droop detection and regulation for processor tiles|
EP3338192A1|2015-08-18|2018-06-27|Telefonaktiebolaget LM Ericsson |Method for observing software execution, debug host and debug target|
WO2017120270A1|2016-01-04|2017-07-13|Gray Research LLC|Massively parallel computer, accelerated computing clusters, and two dimensional router and interconnection network for field programmable gate arrays, and applications|
US10324730B2|2016-03-24|2019-06-18|Mediatek, Inc.|Memory shuffle engine for efficient work execution in a parallel computing system|
GB2569775B|2017-10-20|2020-02-26|Graphcore Ltd|Synchronization in a multi-tile, multi-chip processing arrangement|
GB2569844B|2017-10-20|2021-01-06|Graphcore Ltd|Sending data off-chip|
GB2569271B|2017-10-20|2020-05-13|Graphcore Ltd|Synchronization with a host processor|GB2569775B|2017-10-20|2020-02-26|Graphcore Ltd|Synchronization in a multi-tile, multi-chip processing arrangement|
GB2580165B|2018-12-21|2021-02-24|Graphcore Ltd|Data exchange in a computer with predetermined delay|
TW202113596A|2019-08-16|2021-04-01|美商谷歌有限責任公司|Explicit scheduling of on-chip operations|
US11127442B2|2019-12-06|2021-09-21|Xilinx, Inc.|Data transfers between a memory and a distributed compute array|
GB2590661B|2019-12-23|2022-02-09|Graphcore Ltd|Sync network|
GB2597078A|2020-07-14|2022-01-19|Graphcore Ltd|Communication between host and accelerator over network|
GB2597945A|2020-08-11|2022-02-16|Graphcore Ltd|Predictive clock control|
法律状态:
2019-10-15| PLFP| Fee payment|Year of fee payment: 2 |
2020-10-29| PLFP| Fee payment|Year of fee payment: 3 |
2021-10-27| PLFP| Fee payment|Year of fee payment: 4 |
2022-01-21| PLSC| Publication of the preliminary search report|Effective date: 20220121 |
优先权:
申请号 | 申请日 | 专利标题
GB1717294.1A|GB2569775B|2017-10-20|2017-10-20|Synchronization in a multi-tile, multi-chip processing arrangement|
GB1717294.1|2017-10-20|
[返回顶部]